From Silence to Stream: Building Low-Latency AI with Node.js, React & WebSockets
A loading spinner is not a UX strategy.
In my previous post, I argued that production AI behaves like a contract, not a conversation. We used Zod to keep LLM outputs from breaking Node.js systems, that was the integrity side.
But even a perfectly validated response that arrives after 15 seconds of silence is still a broken product: users don't experience your architecture; they experience the wait.
This reframes the problem from the contract to the flow.
The right mental model: LLMs are unreliable upstreams
Junior implementations treat the model like a fast database call. You fire a request, you wait, you get a result.
Senior implementations treat it like what it actually is: a slow, non-deterministic, expensive upstream dependency that may fail mid-stream, outlive the user's session, or keep generating after the user already left.
Once you internalize that, everything changes.
The backend can't just proxy text. It has to manage lifecycle.
The frontend can't just append strings. It has to synchronize incomplete state under uncertainty.
That's why streaming is not a transport problem. It's a systems problem.
SSE vs. WebSockets: the real tradeoff
For simple one-way token delivery, Server-Sent Events (SSE) are a perfectly valid choice. Simpler to implement, native reconnect, works over plain HTTP.
But production AI interfaces are rarely one-way.
Users cancel. They send follow-up context mid-stream. They trigger retries. Future agentic flows will require real-time back-and-forth between client and server during a single generation.
That's why I reach for WebSockets. Not because they always win, but because bidirectional control is what modern AI interaction models actually need. SSE is a great fit for a simple "show me the tokens" use case. WebSockets are a better fit when the user and the server both need to talk while the model is thinking.
Don't stream raw strings. Design a protocol.
This is the detail most streaming tutorials skip, and it's what separates a demo from a maintainable system.
If you pipe anonymous token chunks from server to client, you have no way to correlate events, handle overlapping requests, or debug what went wrong. The moment you need cancellation, metrics, or multiple concurrent generations, the whole thing falls apart.
Define an explicit wire protocol first:
// The server can send:
type ServerEvent =
| { type: 'ai:start'; requestId: string }
| { type: 'ai:token'; requestId: string; chunk: string }
| { type: 'ai:done'; requestId: string }
| { type: 'ai:error'; requestId: string; message: string };
// The client can send:
type ClientEvent =
| { type: 'ai:generate'; requestId: string; prompt: string }
| { type: 'ai:cancel'; requestId: string };
Now every event is typed, traceable, and tied to a requestId. You can log time-to-first-token per request. You can safely handle concurrent generations. You can evolve the protocol without turning the socket layer into guesswork.
This is where streaming starts to feel like engineering.
The backend: your job is orchestration, not proxying
The server's real job is to coordinate the lifecycle of a generation safely. That means creating request-scoped state, enabling true cancellation, and cleaning up when things go wrong.
// server/stream-manager.ts
import { WebSocketServer, WebSocket } from 'ws';
import OpenAI from 'openai';
const wss = new WebSocketServer({ port: 8080 });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
// Track in-flight streams by requestId
const activeStreams = new Map<string, AbortController>();
const STREAM_TIMEOUT_MS = 30_000; // LLMs can hang, you set the ceiling
wss.on('connection', (ws) => {
ws.on('message', async (raw) => {
const event = JSON.parse(raw.toString());
if (event.type === 'ai:generate') {
const { requestId, prompt } = event;
const controller = new AbortController();
activeStreams.set(requestId, controller);
// Hard timeout: don't let the upstream decide when you give up
const timeout = setTimeout(() => {
controller.abort();
ws.send(JSON.stringify({ type: 'ai:error', requestId, message: 'Stream timed out.' }));
}, STREAM_TIMEOUT_MS);
ws.send(JSON.stringify({ type: 'ai:start', requestId }));
try {
const stream = await openai.chat.completions.create(
{
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }],
stream: true,
},
{ signal: controller.signal } // AbortController propagates to the SDK
);
for await (const chunk of stream) {
if (ws.readyState !== WebSocket.OPEN) break; // Guard: client may have left
const content = chunk.choices[0]?.delta?.content;
if (content) {
ws.send(JSON.stringify({ type: 'ai:token', requestId, chunk: content }));
}
}
ws.send(JSON.stringify({ type: 'ai:done', requestId }));
} catch (err: any) {
const message = controller.signal.aborted
? 'Generation cancelled.'
: 'Upstream streaming failed.';
ws.send(JSON.stringify({ type: 'ai:error', requestId, message }));
} finally {
clearTimeout(timeout);
activeStreams.delete(requestId); // Always clean up
}
}
if (event.type === 'ai:cancel') {
activeStreams.get(event.requestId)?.abort();
}
});
ws.on('close', () => {
// Orphaned streams: if the user closes the tab, kill the upstream call.
// Ghost generations are the zombie processes of the AI era = wasted tokens,
// wasted spend, misleading metrics.
activeStreams.forEach((controller) => controller.abort());
activeStreams.clear();
});
});
Three things here are worth flagging explicitly:
AbortControllerpropagated to the SDK: cancellation is real, not cosmetic. A "Stop" button that only hides the UI is still burning tokens on the backend.Hard timeout: you decide when to give up, not the model.
ws.readyStateguard: sending on a closed socket throws. Always check before you write.
The frontend: streaming is a state synchronization problem
This is where most AI demos quietly fall apart.
At first glance it looks simple: token arrives, append token, render text. But streaming UI is not a text problem, it's a state synchronization problem.
Tokens arrive asynchronously. Multiple requests can overlap. Streams can fail after partial output. The UI has to stay stable through all of it.
The key insight: don't model this as "the current answer." Model it as request-scoped generation state.
// client/src/components/AIChat.tsx
import { useEffect, useRef, useState } from 'react';
type MessageState = {
requestId: string;
content: string;
status: 'streaming' | 'done' | 'error';
error?: string;
};
export function AIChat() {
const socketRef = useRef<WebSocket | null>(null);
const [messages, setMessages] = useState<MessageState[]>([]);
useEffect(() => {
const ws = new WebSocket('ws://localhost:8080');
socketRef.current = ws;
ws.onmessage = ({ data }) => {
const msg = JSON.parse(data);
if (msg.type === 'ai:start') {
setMessages((prev) => [
...prev,
{ requestId: msg.requestId, content: '', status: 'streaming' },
]);
}
if (msg.type === 'ai:token') {
setMessages((prev) =>
prev.map((m) =>
m.requestId === msg.requestId
? { ...m, content: m.content + msg.chunk } // Functional update — avoids stale closure
: m
)
);
}
if (msg.type === 'ai:done') {
setMessages((prev) =>
prev.map((m) =>
m.requestId === msg.requestId ? { ...m, status: 'done' } : m
)
);
}
if (msg.type === 'ai:error') {
setMessages((prev) =>
prev.map((m) =>
m.requestId === msg.requestId
? { ...m, status: 'error', error: msg.message }
: m
)
);
}
};
return () => ws.close();
}, []);
const send = (prompt: string) => {
const requestId = crypto.randomUUID();
socketRef.current?.send(JSON.stringify({ type: 'ai:generate', requestId, prompt }));
};
const cancel = (requestId: string) => {
socketRef.current?.send(JSON.stringify({ type: 'ai:cancel', requestId }));
};
return (
<div>
{messages.map((m) => (
<div key={m.requestId}>
<p>{m.content}</p>
{m.status === 'streaming' && <span className="cursor">▍</span>}
{m.status === 'streaming' && (
<button onClick={() => cancel(m.requestId)}>Stop</button>
)}
{m.status === 'error' && <p className="error">{m.error}</p>}
</div>
))}
</div>
);
}
The blinking cursor (▍) is a small detail that earns its place. Without it, a streaming UI feels broken during the gaps between tokens. It signals life.
The requestId-scoped state is what makes this architecture safe to extend. The moment your product needs retries, multiple open chats, or multi-step agent flows, a single content string falls over immediately.
One more trap: the re-render storm
Calling setState on every individual token is fine in demos. In production, at 30+ tokens/second, you can trigger more re-renders than React can handle cleanly, leading to jitter and laggy input.
The fix: accumulate chunks in a ref, then flush into state on an animation frame boundary or a short interval. Users still see continuous output; React does less work.
That's the kind of optimization that doesn't show up in tutorials, but does show up in production performance reviews.
The metrics that actually matter
| Metric | Blocking (res.json) |
Streaming (WebSocket) |
|---|---|---|
| Time to First Token | ~8,000ms | ~500ms |
| User experience | 8s blank screen | Immediate feedback |
| Cancellation | ❌ | ✅ (real, backend-level) |
| Orphaned streams | N/A | Handled on ws.close |
| Cost on user abandon | Full token cost | Partial cost only |
Time to First Token (TTFT) is the number that matters most. A 12-second response that starts in 500ms feels faster than a 6-second response that shows nothing until it's done. That's not a trick, that's the actual psychology of waiting.
Track TTFT. Track your cancellation rate. If users are stopping 40% of generations from a specific model or prompt, your architecture is fine, your model choice is failing the product.
The partial JSON trap
This connects directly to the previous post, and it's worth naming explicitly.
Streaming is straightforward for plain text. But if your downstream logic depends on schema-valid JSON (which it should, per the Zod post), you can't parse mid-stream. Partial JSON is not valid JSON. Feeding it to TicketSchema.parse() will throw.
Your architecture needs a clear answer to this: either stream the display layer (text/markdown) and parse the structured output only after ai:done, or use a dedicated structured output endpoint without streaming. Mixing streaming UX with streaming structured data requires extra care.
Final takeaway
The previous post was about protecting the system from bad AI output.
This one is about protecting the user experience from bad AI latency.
Both matter, because production AI fails in two directions: the model says the wrong thing, or the model takes too long saying the right thing. Zod handles the first. Real-time streaming handles the second.
The line between wrapping an API and shipping an AI product is exactly here: do you manage the lifecycle of the model's behavior, or do you just forward it?
Stop chasing the perfect prompt. Start designing the lifecycle around an imperfect model.
Full source code: github.com/Pawl0/ai-websocket-streaming
#NodeJS #ReactJS #TypeScript #WebSockets #AI #LLM #SoftwareEngineering #Fullstack #RealTimeSystems #SystemDesign #OpenAI #Latency #ChatGPT
