Skip links

Streaming Responses in AI Applications: Server-Sent Events Deep Dive

LLM responses are slow. GPT-4 generates tokens at roughly 20-40 tokens per second. For a 500-token response, that means a 12-25 second wait before the user sees anything. Without streaming, your AI-powered feature feels broken. Users stare at a spinner, wonder if the application crashed, and close the tab. With streaming, the first token appears in under a second and the response builds naturally, like watching someone type. The perceived latency drops from 12 seconds to under 1 second, even though the total generation time is identical.

Article Overview

Streaming Responses in AI Applications: Server-Sent Event…

7 sections · Reading flow

01
Server-Sent Events: The Right Tool for LLM…
02
Server Implementation: Node.js with Express
03
Client Implementation: React with Custom Hooks
04
Handling Backpressure and Connection Drops
05
Infrastructure Considerations
06
Rendering Performance
07
Testing Streaming Endpoints

HARBOR SOFTWARE · Engineering Insights

Streaming is not optional for user-facing AI applications. It is a core UX requirement that separates products users love from products users abandon. The technical implementation involves Server-Sent Events (SSE), chunked transfer encoding, and careful state management on both client and server. Here is how we build streaming AI features at Harbor Software, covering the full stack from server to client to infrastructure.

Server-Sent Events: The Right Tool for LLM Streaming

You have three options for real-time server-to-client communication: WebSockets, long polling, and Server-Sent Events. For LLM streaming, SSE is almost always the right choice. The decision is not close.

Why not WebSockets? WebSockets are bidirectional and maintain a persistent connection. LLM streaming is fundamentally unidirectional: the server sends tokens to the client. The client sends a request and then only receives. The bidirectional overhead of WebSockets is unnecessary and adds complexity to your infrastructure. WebSockets also require special handling in load balancers, CDNs, and serverless environments. They do not reconnect automatically on connection loss. And they add a protocol upgrade step that some corporate proxies block.

Why not long polling? Long polling works but wastes resources by creating a new HTTP connection for each chunk. For a 500-token response, that is 500 HTTP connections (or batched, but still wasteful). The latency of creating new connections adds jitter to the token stream.

Why SSE? Server-Sent Events are simple, use standard HTTP, work with existing infrastructure (load balancers, proxies, CDNs), automatically reconnect on connection loss (built into the browser’s EventSource API), and are natively supported in all modern browsers. They use a single persistent HTTP connection that the server writes to over time.

The protocol is text-based and human-readable, which makes debugging trivial:

// SSE protocol format - each event is a data line followed by a blank line
data: {"token": "Hello"}

data: {"token": " world"}

data: {"token": ","}

data: {"token": " how"}

data: {"token": " are"}

data: {"token": " you"}

data: {"token": "?"}

data: [DONE]

Each message is prefixed with data: and separated by double newlines. That is the entire protocol. You can debug it with curl and watch the tokens arrive in real time.

Server Implementation: Node.js with Express

The server acts as a proxy between your client and the LLM provider. It receives the user’s request, opens a streaming connection to the LLM, and forwards tokens to the client as they arrive. The server also handles authentication, rate limiting, input validation, and logging, none of which should be done on the client.

import express from 'express';
import OpenAI from 'openai';

const app = express();
const openai = new OpenAI();

app.post('/api/chat/stream', authenticate, rateLimit, async (req, res) => {
  const { messages } = req.body;

  // Validate input
  if (!messages || !Array.isArray(messages) || messages.length === 0) {
    return res.status(400).json({ error: 'messages array is required' });
  }

  // Set SSE headers - every single one matters
  res.writeHead(200, {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache, no-transform',
    'Connection': 'keep-alive',
    'X-Accel-Buffering': 'no',         // Disable Nginx buffering
    'Access-Control-Allow-Origin': '*', // CORS if needed
  });

  // Send initial event to confirm connection
  res.write(`data: ${JSON.stringify({ type: 'start' })}nn`);

  let totalTokens = 0;

  try {
    const stream = await openai.chat.completions.create({
      model: 'gpt-4',
      messages,
      stream: true,
      max_tokens: 1000,
    });

    for await (const chunk of stream) {
      const content = chunk.choices[0]?.delta?.content;
      if (content) {
        totalTokens++;
        res.write(`data: ${JSON.stringify({ type: 'token', content })}nn`);
      }

      if (chunk.choices[0]?.finish_reason === 'stop') {
        res.write(`data: ${JSON.stringify({
          type: 'done',
          usage: { totalTokens }
        })}nn`);
      }
    }
  } catch (error) {
    console.error('Stream error:', error);
    res.write(`data: ${JSON.stringify({
      type: 'error',
      message: 'An error occurred during generation'
    })}nn`);
  } finally {
    res.end();
  }
});

Critical details that are easy to miss and will cause debugging headaches if you get them wrong:

  • X-Accel-Buffering: no disables Nginx buffering. Without this, Nginx buffers the entire response before sending it to the client, defeating the purpose of streaming entirely. The user sees nothing for 15 seconds and then the entire response at once. This is the single most common streaming deployment issue.
  • Cache-Control: no-cache, no-transform prevents intermediate proxies from caching the stream. The no-transform directive prevents proxies from compressing the stream, which can introduce buffering delays.
  • Each data: line must end with nn (double newline). A single newline does not terminate the event. This is the SSE protocol requirement.
  • Always end the stream with a completion signal ([DONE] or a typed JSON marker) so the client knows when to stop listening and can clean up resources.
  • Send a start event immediately after setting headers. This confirms the connection is established and helps the client distinguish between “waiting for first token” and “connection failed.”

Client Implementation: React with Custom Hooks

On the client side, you need to consume the SSE stream and update the UI incrementally. The native EventSource API only supports GET requests, which is insufficient for chat applications that need to POST message history. Use the fetch API with the ReadableStream interface for manual stream parsing.

import { useState, useCallback, useRef } from 'react';

export function useStreamingChat() {
  const [response, setResponse] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);
  const [error, setError] = useState(null);
  const abortRef = useRef(null);

  const cancel = useCallback(() => {
    if (abortRef.current) {
      abortRef.current.abort();
      abortRef.current = null;
      setIsStreaming(false);
    }
  }, []);

  const sendMessage = useCallback(async (messages) => {
    // Cancel any previous in-flight request
    cancel();

    setResponse('');
    setIsStreaming(true);
    setError(null);

    abortRef.current = new AbortController();

    try {
      const res = await fetch('/api/chat/stream', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ messages }),
        signal: abortRef.current.signal,
      });

      if (!res.ok) {
        throw new Error(`HTTP ${res.status}: ${res.statusText}`);
      }

      const reader = res.body.getReader();
      const decoder = new TextDecoder();
      let buffer = '';

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });

        // Parse SSE events from buffer
        const lines = buffer.split('nn');
        buffer = lines.pop() || ''; // Keep incomplete event in buffer

        for (const line of lines) {
          const data = line.replace(/^data: /, '').trim();
          if (!data || data === '[DONE]') continue;

          try {
            const parsed = JSON.parse(data);
            if (parsed.type === 'token' && parsed.content) {
              setResponse(prev => prev + parsed.content);
            }
            if (parsed.type === 'done') {
              setIsStreaming(false);
            }
            if (parsed.type === 'error') {
              setError(parsed.message);
              setIsStreaming(false);
            }
          } catch (e) {
            // Ignore malformed chunks - they happen occasionally
          }
        }
      }
    } catch (err) {
      if (err.name === 'AbortError') return; // User cancelled, not an error
      setError(err.message);
    } finally {
      setIsStreaming(false);
    }
  }, [cancel]);

  return { response, isStreaming, error, sendMessage, cancel };
}

The buffer management is the trickiest part of the client implementation. SSE events can be split across multiple reader.read() chunks. A single read might return half of an event, or one and a half events. The buffer accumulates incoming data and only processes complete events (terminated by double newlines). Incomplete events stay in the buffer for the next iteration. Without proper buffering, you will see random JSON parse errors when events are split across chunks.

Handling Backpressure and Connection Drops

In production, connections drop. Users close tabs mid-stream. Mobile networks switch from WiFi to cellular. Server deployments restart. Your streaming implementation must handle all of these gracefully.

Server-Side: Detect Client Disconnection

app.post('/api/chat/stream', async (req, res) => {
  let aborted = false;

  // Detect client disconnection
  req.on('close', () => {
    aborted = true;
    console.log('Client disconnected, aborting stream');
  });

  res.writeHead(200, { /* SSE headers */ });

  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: req.body.messages,
    stream: true,
  });

  for await (const chunk of stream) {
    if (aborted) {
      // Client disconnected, stop processing to save tokens and money
      stream.controller.abort();
      break;
    }

    const content = chunk.choices[0]?.delta?.content;
    if (content) {
      // Wrap write in try-catch for broken pipe errors
      try {
        res.write(`data: ${JSON.stringify({ type: 'token', content })}nn`);
      } catch (writeError) {
        aborted = true;
        break;
      }
    }
  }

  res.end();
});

Aborting the upstream LLM stream when the client disconnects is important for cost management. If a user closes the tab after receiving half the response, you do not want to pay for the remaining tokens. For GPT-4, this can save $0.01-0.03 per aborted request. At 10,000 requests per day with a 5% abort rate, that is $15-45/day in savings, or $450-1,350/month.

Infrastructure Considerations

SSE streaming introduces specific infrastructure requirements that differ from standard request-response APIs. Getting these wrong is the most common cause of “streaming works locally but not in production.”

Nginx Configuration

Most Nginx configurations buffer responses by default. This completely breaks streaming. You need explicit configuration:

# Nginx configuration for SSE
location /api/chat/stream {
    proxy_pass http://backend;
    proxy_http_version 1.1;
    proxy_set_header Connection '';
    proxy_buffering off;           # Critical: disable response buffering
    proxy_cache off;               # Disable caching for this route
    proxy_read_timeout 120s;       # Allow 2-minute streams (GPT-4 can be slow)
    proxy_send_timeout 120s;
    chunked_transfer_encoding on;
    
    # Also disable gzip for streaming routes
    gzip off;
}

Serverless Deployment

Streaming on serverless platforms requires platform-specific configurations. Each major platform has a different approach:

// Next.js App Router with Edge Runtime (Vercel)
export const runtime = 'edge';

export async function POST(req) {
  const { messages } = await req.json();

  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages,
    stream: true,
  });

  const encoder = new TextEncoder();

  const readable = new ReadableStream({
    async start(controller) {
      try {
        for await (const chunk of stream) {
          const content = chunk.choices[0]?.delta?.content;
          if (content) {
            controller.enqueue(
              encoder.encode(`data: ${JSON.stringify({ type: 'token', content })}nn`)
            );
          }
        }
        controller.enqueue(encoder.encode(`data: ${JSON.stringify({ type: 'done' })}nn`));
      } catch (error) {
        controller.enqueue(
          encoder.encode(`data: ${JSON.stringify({ type: 'error', message: error.message })}nn`)
        );
      } finally {
        controller.close();
      }
    },
  });

  return new Response(readable, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache, no-transform',
      'Connection': 'keep-alive',
    },
  });
}

AWS Lambda supports streaming with the RESPONSE_STREAM invoke mode, but requires the Lambda Web Adapter or a custom runtime. Cloudflare Workers support SSE natively through the standard Response constructor with a ReadableStream body. Google Cloud Functions do not support streaming as of this writing, so you need Cloud Run instead.

Rendering Performance

Streaming tokens to the client is half the problem. Rendering them efficiently is the other half. Naively appending each token triggers a React re-render, which at 20-40 tokens per second means 20-40 re-renders per second. This causes visible jank on complex UIs, especially when you are also rendering markdown, syntax highlighting, or other formatting.

Solutions we have implemented and measured:

  • Batch state updates. Buffer tokens for 50-100ms before updating React state. The user will not notice the delay (it is well below the perception threshold of ~200ms), but your render count drops from 40/second to 10/second. This alone eliminated all visible jank in our applications.
  • Use requestAnimationFrame to synchronize updates with the browser’s paint cycle. This ensures you never update state more often than the browser can render.
  • Defer markdown parsing. If you are using a markdown renderer (react-markdown, marked), only re-parse when a paragraph break is detected or when streaming ends. Parsing markdown on every token is expensive and unnecessary since the user cannot read that fast anyway.
  • Virtualize long responses. For very long streaming responses (1000+ tokens), only render the visible portion of the text using a virtualized list. This keeps DOM operations constant regardless of response length.
// Batched state updates using requestAnimationFrame
function useTokenBatcher() {
  const tokenBuffer = useRef([]);
  const frameRef = useRef(null);
  const [text, setText] = useState('');

  const addToken = useCallback((token) => {
    tokenBuffer.current.push(token);

    if (!frameRef.current) {
      frameRef.current = requestAnimationFrame(() => {
        const batch = tokenBuffer.current.join('');
        tokenBuffer.current = [];
        frameRef.current = null;
        setText(prev => prev + batch);
      });
    }
  }, []);

  // Flush remaining tokens when streaming ends
  const flush = useCallback(() => {
    if (frameRef.current) {
      cancelAnimationFrame(frameRef.current);
      frameRef.current = null;
    }
    if (tokenBuffer.current.length > 0) {
      const batch = tokenBuffer.current.join('');
      tokenBuffer.current = [];
      setText(prev => prev + batch);
    }
  }, []);

  return { text, addToken, flush };
}

Testing Streaming Endpoints

Testing streaming endpoints requires different approaches than testing regular REST endpoints. Here are the patterns we use:

// Integration test for streaming endpoint
it('should stream tokens correctly', async () => {
  const response = await fetch('/api/chat/stream', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ messages: [{ role: 'user', content: 'Say hello' }] })
  });

  expect(response.status).toBe(200);
  expect(response.headers.get('content-type')).toBe('text/event-stream');

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  const events = [];
  let buffer = '';

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split('nn');
    buffer = lines.pop() || '';
    for (const line of lines) {
      const data = line.replace(/^data: /, '').trim();
      if (data) events.push(JSON.parse(data));
    }
  }

  expect(events[0].type).toBe('start');
  expect(events[events.length - 1].type).toBe('done');
  expect(events.filter(e => e.type === 'token').length).toBeGreaterThan(0);
});

Conclusion

Streaming is table stakes for AI-powered user interfaces. The implementation involves SSE on the transport layer, careful buffer management on the client, abort handling for cost control, infrastructure configuration for long-lived connections, and render optimization for smooth display.

The Vercel AI SDK (ai npm package) handles much of this complexity if you are in the Next.js ecosystem. For custom setups, the patterns in this post cover the essential pieces.

A few additional production considerations worth noting. First, implement request deduplication: if a user clicks the send button twice rapidly, you should not open two streams. Use the AbortController pattern shown above to cancel the previous request before starting a new one. Second, add a heartbeat mechanism for very long streams. Send a comment event (: heartbeat) every 15 seconds during long LLM generations to keep the connection alive through proxies and load balancers that have idle timeout settings. Third, consider implementing reconnection logic on the client side: if the connection drops mid-stream, attempt to resume from the last received token by resending the request with the partial response as context.

The most common production issues are Nginx buffering (add proxy_buffering off), incomplete event parsing (use the buffer pattern), render jank (batch token updates with requestAnimationFrame), and corporate proxy interference (some enterprise proxies buffer SSE connections regardless of headers, requiring WebSocket fallback for those environments). Get the first three right and your streaming implementation will be solid for 95%+ of users. The fourth is an edge case you address when enterprise customers report issues.

Finally, consider the user experience beyond the technical implementation. Streaming a wall of text is not inherently good UX. Think about how the streamed content is presented: use a typing indicator before the first token arrives, fade in text rather than snapping it into existence, scroll smoothly to keep the latest content visible, and provide a clear visual signal when generation is complete. The technical streaming infrastructure enables good UX, but the presentation layer is what users actually experience. A well-implemented streaming backend with poor rendering can feel worse than a non-streaming response that appears all at once after a polished loading state. The goal is perceived responsiveness, and that comes from the combination of fast first-token delivery and smooth, progressive rendering on the client.

Leave a comment

Explore
Drag