First-token latency is a UX problem, not an infrastructure one.

Every AI product has the same moment. The user clicks a button, and the model starts thinking. Two seconds pass. Three. Five. The spinner turns. The user starts looking away, or refreshing the page, or closing the tab.

Time-to-first-token is the single variable that decides whether an AI product feels fast or slow. It is not about total generation time. A response that takes twelve seconds total but starts streaming in 400 milliseconds feels quick. A response that takes three seconds total but sits blank for two feels broken.

Most AI product teams we audit are still treating this as an infrastructure problem. It is partly infrastructure, but the bigger fix is in the interface. Design has to do the work that the backend cannot.

The latency the user feels

Three numbers to track, roughly in order of how much they matter.

First-token latency. From the user pressing the button to the first character appearing on screen. This is what sets the tempo. Target: under 800 ms for conversational UI, under 300 ms for inline completion. Anything above two seconds feels broken regardless of what happens next.

Streaming throughput. How fast characters arrive after the first token. Modern models stream at 30 to 120 tokens per second. Users read at maybe five to seven tokens per second. As long as the stream stays faster than the eye, it feels instant. If the stream ever stalls for more than 200 ms mid-generation, the illusion breaks and the user starts watching the animation instead of the content.

Total time. The least important of the three, because it is the metric users feel least directly. Most users will accept a long answer if the answer is clearly arriving.

Optimise in order: first-token, then smooth streaming, then total. Teams get this backwards all the time.

The design patterns that buy time

Six moves we make on almost every AI product.

Stream everything that can stream. If the model supports server-sent events, use them. If the framework does not surface SSE cleanly, switch frameworks or patch around it. A non-streaming UI on a streaming-capable model is wasted product value, full stop.

Optimistic UI. The user's message appears in the conversation the moment they press send, not after the server confirms. The system's response area renders its pending state instantly. Pressing submit should have an immediate visual consequence, even if the network call is still in flight.

A skeleton that matches the output shape. If the response is going to be a table, render an empty table skeleton the moment the stream starts. If it is going to be a card with three fields, show three grey rectangles. The user's eye adjusts to the target shape before the content arrives. Generic spinner skeletons are worth something; shape-matched skeletons are worth more.

A stop button, always. If the user is going to wait, they need the option not to. A prominent stop button during generation reduces frustration dramatically. The cost is trivial: one button, one abort signal, a cancel handler on the server. The benefit is that the user feels in control of the process, even when the model is slow. Products that omit the stop button read as arrogant.

Reasoning visibility when it earns its keep. For tasks that genuinely take five or more seconds, showing a brief "Thinking about X. Checking Y. Composing answer." line keeps the user oriented. But only when the line is real. Fake reasoning animations that do not reflect what the model is doing read as patronising within one or two uses. We only show reasoning when the underlying agent is in a real tool-call loop the user can track, and we show the actual tool names and results when the user wants to look.

Edge proxies for the round trip. A lot of AI latency is the last mile to the provider. Hosting the client-facing proxy close to the user shaves 100 to 300 ms off first-token latency for free. It is a deployment decision, not a design one, but design teams should know to ask for it, because nobody will add it on their own.

The patterns that feel bad

Things we see on poorly built AI products, in rough order of severity.

Indeterminate spinner. The user cannot tell if the system is working or stuck. The cost is zero information. Users bail. Replace with a skeleton and a live character counter, or at least a streamed log of what is happening.

Total-time progress bar. Because model output is variable-length, a progress-to-100-per-cent bar lies most of the time. Either it stalls at 85 per cent for three seconds, or it leaps from 40 per cent to done. Remove it. Stream the output instead. The stream is the progress indicator.

Disabled input during generation. If the user wants to cancel and try something else, let them. The generation aborts, the new request takes over. Locking the input while the model is working is legacy thinking from 2022 chatbots and should not survive a usability review in 2026.

Hidden retries. The model fails silently on the server. The UI shows a spinner for nine seconds. The retry succeeds. Three seconds later, the full answer arrives. The user experienced twelve seconds of unexplained waiting and quietly decided the product is slow. Surface the retry. "Connection slow. Trying again." is better than silence, every time.

The confidence question

A debate that still goes both ways inside our team. Should the UI show the model's confidence alongside the output?

The answer we have landed on: only when the downstream action differs based on confidence. A "file this" action that auto-files on high confidence and asks for confirmation on low confidence should show the confidence, because it explains the ask. A "draft this email" action that surfaces a confidence percentage numerically just makes the user feel worse about a reasonable suggestion. Confidence as a UI element is only useful when it shapes the interaction the user is about to take.

We have watched otherwise good products undermine themselves by sprinkling confidence scores next to every answer. The user's brain converts the score to a mild anxiety, and the feature stops getting used.

One last note on the cost of small improvements

We built and shipped a variant of a client's AI feature where the only difference between the two versions was that one streamed the output and one did not. Identical model, identical prompt, identical latency on the server. The streaming version's adoption rate, measured three weeks in, was 47 per cent higher than the non-streaming version's. Same answer, delivered with a different UX, treated by users as a different feature.

That is the scale of the effect. The model is slow. The interface does not have to feel slow. First-token latency, streaming, a stop button, a skeleton, and a visible reasoning step are five design decisions that can make a ten-second response feel like a two-second one. The underlying infrastructure is harder to fix and less load-bearing than most teams think.

Design the waiting, not the waiting away.

First-token latency is a UX problem, not an infrastructure one.

The latency the user feels

The design patterns that buy time

The patterns that feel bad

The confidence question

One last note on the cost of small improvements

Structured outputs are the new prompt.

The best AI feature is the one you don't see.

Ready to build?