Building a Fully Client-Side AI Chatbot with WebGPU, Transformers.js, and the Vercel AI SDK

React AI

How I built a conversational AI assistant that runs entirely in the browser, with no server, no API keys, and no backend, using IBM Granite, Transformers.js, and the Vercel AI SDK.

Most portfolio chatbots either hit an external LLM API (OpenAI, Anthropic, etc.) requiring API keys, billing, and a backend, or they're just glorified FAQ accordions with no real intelligence.

I wanted something different: a real conversational model running entirely in the browser. No server calls, no API keys, no cold starts. The model loads once, gets cached by the browser, and every subsequent visit is instant.

The Value Proposition

Concern	Server-Side AI	Client-Side AI
Infrastructure	API keys, servers, billing	Zero, static hosting only
Privacy	Data leaves the device	Conversations stay local
Latency	Network round-trips	Direct GPU inference
Offline	Requires connectivity	Works after first download
Cost	Per-token pricing	Free after initial load

This approach works well for portfolio sites, documentation assistants, and any context where the knowledge base is bounded and the interaction is conversational.

Architecture Overview

The entire pipeline runs client-side. There are six main pieces, and each one has a specific job:

Chatbot (React Component) is the top-level UI. It renders a floating popover anchored to the bottom-right of every page with a chat bubble trigger, a message list, and an input form. It handles form submission, maps over messages to render the right content for each part type (download progress, text, reasoning), and shows a bouncing writing indicator while the model is thinking. It doesn't know anything about AI. It just consumes the API that useChatbot exposes.

useChatbot (React Hook) is the bridge between the UI and the AI engine. It checks browser support at runtime via doesBrowserSupportTransformersJS(), creates the model and transport once using useMemo, and wires everything into the AI SDK's useChat hook. It exposes a clean surface: messages, status, sendMessage, stop, and a boolean isBotLoadingResponse so the UI can disable the input during generation. If the browser doesn't support WebGPU, the hook returns supportsTransformerJs: false and the component renders nothing.

ChatbotTransport (Custom Transport) is where the real work happens. The Vercel AI SDK expects a ChatTransport implementation that knows how to send messages to a model and stream responses back. This class fulfills that contract for a browser-side model. It converts UI messages into model-compatible format, checks if the model needs downloading (and streams progress as custom message parts if so), runs streamText with reasoning middleware to separate <think> tags from visible output, and merges the result into a UIMessageStream. The transport also injects the system prompt, which is a long, detailed description of the portfolio owner that turns a generic 350M model into a personalized assistant.

worker.ts (Web Worker) keeps inference off the main thread. It's only three lines: it imports TransformersJSWorkerHandler, creates an instance, and wires it to self.onmessage. All the heavy lifting (ONNX Runtime initialization, WebGPU session management, tokenization, streaming generation) is handled internally by the handler. The main thread never blocks.

Response (Markdown Renderer) takes the raw text from the model and renders it as HTML using markdown-it with syntax highlighting (markdown-it-highlightjs) and copy-to-clipboard buttons (markdown-it-code-copy). It's wrapped in React.memo with a custom comparator that only triggers re-renders when responseText changes, which matters a lot during streaming when the parent re-renders on every token batch.

Conversation (Scroll Container) wraps the use-stick-to-bottom library to handle streaming-aware scroll behavior. It auto-sticks to the bottom while tokens arrive, smoothly scrolls new messages into view, and shows a floating arrow button when the user scrolls up to read earlier messages. This replaces the naive scrollTop = scrollHeight approach that causes janky jumps during streaming.

How They Connect

User types a question and submits the React form inside Chatbot
useChatbot forwards it to the AI SDK's useChat, which calls ChatbotTransport.sendMessages()
ChatbotTransport converts the messages, checks model availability, and starts streamText with the system prompt
The Web Worker runs Granite inference on the GPU and streams tokens back
Tokens flow through the UIMessageStream back into useChat, which updates React state
Chatbot re-renders with new message parts, delegating to Response for markdown and Conversation for scroll behavior

The important thing here is that this architecture reuses the same useChat hook and ChatTransport interface that server-backed AI applications use. If you ever want to swap to a server-side model, you only need to change the transport layer. The UI code stays the same.

The AI Engine: Granite on WebGPU

The chatbot uses IBM Granite 4.0 350M, a compact instruction-tuned model optimized for browser inference. Model selection matters a lot here: too large and visitors wait forever, too small and the responses become incoherent.

const MODEL_ID = "onnx-community/granite-4.0-350m-ONNX-web";

const model = transformersJS(MODEL_ID, {
    device: "webgpu",   // GPU-accelerated inference
    dtype: "fp16",      // Half-precision for speed/memory balance
    worker: new Worker(
        new URL("@/components/Chatbot/util/worker.ts", import.meta.url),
        { type: "module" }
    ),
});

Why Granite 350M?

350M parameters: small enough to download quickly and run on consumer GPUs
ONNX-Web optimized: pre-converted for browser runtimes via the onnx-community pipeline
Instruction-tuned: follows system prompts well despite its compact size
WebGPU acceleration: uses the visitor's GPU for fast token generation
FP16 precision: halves memory usage vs FP32 with minimal quality loss

The model is loaded through @browser-ai/transformers-js, which wraps Hugging Face's Transformers.js with a higher-level API compatible with the Vercel AI SDK.

First Visit vs Return Visit

On the first visit, the model downloads. The chatbot displays an inline progress bar during this process (more on that in the transport section). On subsequent visits, the browser's Cache Storage serves the model instantly with no network request at all. This caching behavior is automatic and requires zero configuration.

Custom Chat Transport: The Heart of the System

The Vercel AI SDK (ai package) is designed for server-side LLMs. To use it client-side, you need a custom ChatTransport, which is the interface that tells the SDK how to send messages to the model and how to stream responses back. This is where the real architectural work happens.

export class ChatbotTransport implements ChatTransport<TransformersUIMessage> {
    private readonly model: TransformersJSLanguageModel;
    private readonly systemPrompt: string;

    constructor(model: TransformersJSLanguageModel, options: ChatbotTransportOptions) {
        this.model = model;
        this.systemPrompt = options.systemPrompt;
    }

    async sendMessages(options): Promise<ReadableStream<UIMessageChunk>> {
        const { messages, abortSignal } = options;
        const prompt = await convertToModelMessages(messages);

        return createUIMessageStream<TransformersUIMessage>({
            execute: async ({ writer }) => {
                // 1. Check model availability
                // 2. Show download progress if needed
                // 3. Run streaming inference with reasoning middleware
                // 4. Pipe tokens to the UI message stream
            },
        });
    }
}

Download Progress as Message Parts

Instead of a separate loading overlay, download progress is written as custom data-modelDownloadProgress message parts. The progress bar lives inside the chat conversation, which keeps the user oriented within the chat context:

const availability = await model.availability();

if (availability !== "available") {
    await model.createSessionWithProgress((progress) => {
        const percent = Math.round(progress * 100);

        if (progress >= 1) {
            writer.write({
                type: "data-modelDownloadProgress",
                id: downloadProgressId,
                data: { status: "complete", progress: 100, message: "Model ready!" },
            });
            return;
        }

        if (!downloadProgressId) {
            downloadProgressId = `download-${Date.now()}`;
        }

        writer.write({
            type: "data-modelDownloadProgress",
            id: downloadProgressId,
            data: {
                status: "downloading",
                progress: percent,
                message: `Downloading AI model... ${percent}%`,
            },
        });
    });
}

The nice thing about this approach is that the download progress is treated as just another message part, so the AI SDK's message state management handles it automatically. No separate loading state, no extra React context, no coordination logic.

Reasoning Middleware

The transport wraps the model with extractReasoningMiddleware to parse <think> tags. If the model reasons before answering (chain-of-thought), those tokens are separated from the visible response and can be rendered in a collapsible section:

const result = streamText({
    model: wrapLanguageModel({
        model,
        middleware: extractReasoningMiddleware({ tagName: "think" }),
    }),
    system: this.systemPrompt,
    messages: prompt,
    abortSignal,
});

writer.merge(result.toUIMessageStream({ sendStart: false }));

This middleware integration is one of the perks of building on the AI SDK: reasoning extraction works the same way whether the model runs on a server or in the browser.

System Prompt Injection

The transport injects a detailed system prompt that gives the model full context about the portfolio owner: background, experience, communication style, and how to adapt responses for different audiences (recruiters, developers, casual visitors). The Granite 350M model is small. It doesn't know everything. But it doesn't need to. The system prompt provides all the context, and the model's job is to be a conversational interface to that context, not a knowledge base.

The Web Worker: Keeping the UI Alive

AI inference is computationally expensive. Running it on the main thread would freeze the entire page: no scrolling, no animations, no interaction. The solution is a dedicated Web Worker that handles all model operations off-thread:

// worker.ts
import { TransformersJSWorkerHandler } from "@browser-ai/transformers-js";

const handler = new TransformersJSWorkerHandler();
self.onmessage = (msg: MessageEvent) => handler.onmessage(msg);

That's the entire worker file. The TransformersJSWorkerHandler abstracts away all the complexity:

ONNX Runtime Web initialization
WebGPU session management
Model loading and caching
Tokenization and inference
Streaming token generation back to the main thread

The worker communicates with the main thread through the standard postMessage API, but all of this is abstracted by the transformersJS() model wrapper. From the transport's perspective, the model is just an async function that yields tokens.

The React Hook: `useChatbot`

The useChatbot hook is the bridge between React and the AI engine. It handles three concerns: browser capability detection, transport initialization, and exposing a clean API to the UI component:

export function useChatbot() {
    // 1. Runtime feature detection
    const supportsTransformerJs = useMemo(
        () => doesBrowserSupportTransformersJS(), []
    );

    // 2. Create transport (model + system prompt), only once
    const chatTransport = useMemo(() => {
        if (!supportsTransformerJs) return null;

        const model = transformersJS(MODEL_ID, {
            device: "webgpu",
            dtype: "fp16",
            worker: new Worker(
                new URL("@/components/Chatbot/util/worker.ts", import.meta.url),
                { type: "module" }
            ),
        });

        return new ChatbotTransport(model, { systemPrompt: SYSTEM_PROMPT });
    }, [supportsTransformerJs]);

    // 3. Wire up the AI SDK's useChat
    const { error, status, messages, sendMessage, stop } =
        useChat<TransformersUIMessage>({
            transport: chatTransport!,
            experimental_throttle: 50,
            id: "chatbot-pakybot",
        });

    // 4. Expose a clean API
    return {
        supportsTransformerJs,
        messages,
        status,
        isBotLoadingResponse: status === "submitted" || status === "streaming",
        sendMessage: (text: string) => sendMessage({ text }),
        stop,
    };
}

Notable Implementation Details

experimental_throttle: 50 batches React state updates every 50ms instead of on every single token. Without this, streaming a response triggers hundreds of re-renders per second, causing visible jank.
doesBrowserSupportTransformersJS() is runtime feature detection for WebGPU + WASM. If the browser doesn't support these APIs (rare in 2026, but possible), the chatbot silently hides itself instead of crashing.
Memoized transport means the model and worker are created once and reused across messages. The useMemo dependency on supportsTransformerJs ensures we never create a worker in unsupported browsers.

The UI: A Streaming Popover Chat

The chatbot lives in a Popover component (built on Radix UI) anchored to the bottom-right corner of every page. It's rendered inside the Footer, always available but never intrusive:

<Popover>
    <PopoverTrigger asChild className="fixed bottom-4 right-1 z-50">
        <button className="btn btn-ghost btn-circle">
            <BsFillChatQuoteFill size={40} />
        </button>
    </PopoverTrigger>

    <PopoverContent className="w-96">
        {/* Title + description */}
        <Conversation className="h-[200px] no-scrollbar">
            <ConversationContent>
                {/* Message rendering */}
            </ConversationContent>
            <ConversationScrollButton />
        </Conversation>
        {/* Input form */}
    </PopoverContent>
</Popover>

Message Rendering

Each message is rendered based on its parts array, which is the AI SDK's structured message format:

data-modelDownloadProgress: inline progress bar with spinner
text: markdown-rendered response via the <Response /> component
reasoning: collapsible thinking section (when the model uses <think> tags)
User messages: plain text in a chat bubble

The UI disables the input while the model is generating (status !== "ready"), preventing double-submissions. A bouncing writing icon (TbWriting) appears while the model processes the initial request before tokens start streaming.

Scroll Behavior: Sticky Streaming

Streaming chat interfaces need scroll behavior that feels natural. The naive approach of setting scrollTop = scrollHeight on every render causes jarring jumps and fights with user scrolling. The use-stick-to-bottom library (the same one used by ChatGPT-like interfaces) solves this nicely:

<Conversation className="h-[200px] no-scrollbar">
    <ConversationContent>
        {/* messages */}
    </ConversationContent>
    <ConversationScrollButton className="btn-sm animate-bounce" />
</Conversation>

This gives you three behaviors out of the box:

Auto-stick during streaming: as new tokens arrive, the scroll stays pinned to the bottom
Smooth scroll on new messages: user messages and bot responses scroll into view with animation
Scroll-to-bottom button: if the user scrolls up to read earlier messages, a floating arrow button appears to jump back down

The Conversation component wraps StickToBottom with sensible defaults (initial="smooth", resize="smooth", role="log" for accessibility), while ConversationScrollButton uses the useStickToBottomContext hook to conditionally render the jump-to-bottom affordance.

Markdown Rendering with Syntax Highlighting

Bot responses are rendered as markdown using markdown-it with two plugins for syntax highlighting and a copy-to-clipboard button on code blocks:

const responseContent = md()
    .use(highlightjs)
    .use(codecopy, {
        buttonClass: 'btn btn-ghost btn-circle btn-sm',
        iconClass: 'fa fa-solid fa-clone text-neutral-content'
    })
    .render(responseText);

The <Response /> component is memoized with a custom comparator so it only re-renders when responseText actually changes:

export const Response = memo(
    function Response({ responseText, className }) {
        const responseContent = useMemo(
            () => md().use(highlightjs).use(codecopy, { /* config */ }).render(responseText),
            [responseText]
        );

        return (
            <div
                className={cn('prose', className)}
                dangerouslySetInnerHTML={{ __html: responseContent }}
            />
        );
    },
    (prev, next) => prev.responseText === next.responseText,
);

This matters during streaming: the component receives the same text with a few more characters appended each time. Without memoization, every intermediate render would re-parse the entire markdown string. The useMemo on responseText ensures the markdown-to-HTML conversion only runs when the text actually changes, not on every parent re-render.

What Makes This Special

Zero Infrastructure

There is no server. No API. No database. No billing. No rate limits. No cold starts. The entire AI pipeline runs in the visitor's browser. The portfolio is a static Next.js export hosted on GitHub Pages, and the chatbot adds intelligence without adding infrastructure.

Privacy by Design

User conversations never leave the device. There are no analytics on chat content, no logging, no telemetry. The model runs locally: what you ask stays on your machine.

Offline Capable

After the first model download, the chatbot works offline. The ONNX model files are cached in the browser's Cache Storage. Combined with a service worker, the entire portfolio (including the chatbot) could function without an internet connection.

Standard AI SDK Interface

Despite running client-side, the chatbot uses the same useChat hook and ChatTransport interface as server-backed AI applications. This means:

Swapping to a server-side model requires changing only the transport
Tool calling can be added by extending the transport
The streaming UI code is identical to what you'd write for any AI SDK application

Progressive Enhancement

The chatbot checks for WebGPU support at runtime. If the visitor's browser doesn't support it, the chatbot simply doesn't render. No error, no broken UI, no console spam. The feature enhances the experience for capable browsers without degrading it for others.

Limitations and Trade-offs

Model Size vs Quality

Granite 350M is impressive for its size, but it's not GPT-4. Responses can be repetitive on complex questions, occasionally off-topic, and limited in reasoning depth. This is a deliberate trade-off: a larger model would mean a longer download and higher GPU memory usage, which would exclude many visitors.

First-Visit Download

The initial model download is ~350MB. On slow connections, this can take a while. The inline progress bar helps set expectations, but it's still a significant payload for a portfolio website. After the first download, it's cached and instant.

Browser Support

WebGPU is required. As of 2026, this covers Chrome, Edge, and Firefox on desktop, plus Chrome on Android. Safari support is still experimental. Visitors on unsupported browsers simply won't see the chatbot, which is the whole point of the progressive enhancement approach.

No Conversation Persistence

Messages are stored in React state. Closing the popover or refreshing the page resets the conversation. This is intentional: the chatbot is designed for quick questions, not long sessions.

Conclusion

Building a fully client-side AI chatbot is not just a technical exercise. It's a statement about how AI can be deployed without the overhead of servers, API keys, or recurring costs. The combination of WebGPU for hardware acceleration, Transformers.js for model inference, and the Vercel AI SDK for the streaming chat interface creates a stack that is both powerful and portable.

The key takeaways:

The AI SDK's transport abstraction is the architectural linchpin. It lets you swap between server-side and client-side models without touching the UI.
Web Workers are non-negotiable for inference. Running models on the main thread will freeze your application.
Progressive enhancement is the right pattern for cutting-edge browser APIs. Check support at runtime, enhance when available, degrade silently when not.
System prompts do the heavy lifting in small models. The 350M parameter model is effective because the prompt provides all the domain context it needs.

Explore related resources:

IBM Granite Models

Transformers.js Documentation

Vercel AI SDK

WebGPU Specification

Happy coding!