Ben Lee | h1nt — Case Study

The Core Bet

Most interview prep is passive consumption. You watch a walkthrough, nod along, and build false confidence. The actual skill — articulating a solution under pressure, defending tradeoffs, adapting when an interviewer pushes back — only develops through active retrieval. You have to be forced to produce an answer, not just recognize one.

h1nt is built around that constraint. You get a personalized question bank, a streaming AI tutor that evaluates your answers and asks follow-ups, and a mock interview locked behind a Readiness Score. The tutor's persona is the "Tiger Parent" — demanding, specific, not satisfied with hand-wavy answers. If you say "I think this is O(n log n)," it asks you to prove it.

System Design: Five Stages, One Direction

The application has a strict linear flow: onboarding → study plan → tutor → mock → results. State flows forward — there's no branching between stages mid-session.

I chose Zustand for state management and persisted each store to localStorage. Three stores carry session state:

interview.store — company, role, question bank, preferred language, readiness score
tutor.store — message history, answered question IDs, failed question IDs, weak areas
mock.store — mock interview turns, final grade report

The linear flow meant I could keep each store relatively flat — no store needs to know about the others' internal state, only the data it hands forward to the next stage. In hindsight, the session lifecycle would benefit from an explicit state machine (the onboarding → tutor → mock transitions have implicit guards that a state machine would make clearer), but Zustand was the right call for moving fast on v1.

The Question Priority Queue

Questions aren't served randomly. On each call to "next question," the selector runs a priority function over the full question bank:

Unseen questions in identified weak-area topics, sorted by difficulty ascending
Unseen questions in all other topics, sorted by difficulty ascending
Previously failed questions, sorted by difficulty ascending

"Weak areas" are derived lazily — after every three answered questions, the system scans failedQuestionIds for topic patterns and updates the weak area set in the store. No server-side ML, no separate inference step. It's a scoring function over localStorage state.

This is simpler than it sounds and works better than I expected. Most users have real weak areas, and surfacing them early creates the "it knows me" feeling that makes the product feel smart.

The Streaming Tutor: Request Flow and Context Management

The tutor lives at /api/tutor — a Next.js 16 App Router route that streams a response via the Vercel AI SDK v6 with Claude Sonnet 4.6 as the model. Each request carries:

The current question (text + type + topic)
A mode flag: 'evaluate' or 'teach'
The user's answer (empty string on teach mode)
The last N turns of conversation history for context continuity

I deliberately don't send the full conversation history — only the last few turns. Sending everything grows the context window fast, increases latency, and doesn't improve response quality for a Q&A tutor (the model doesn't need to remember what happened five questions ago, just the current exchange). The tradeoff is that the tutor can't reference earlier questions by memory, but I haven't found that to be a real UX gap.

Evaluate vs. Teach Mode

The route handles two distinct behaviors. In evaluate mode, the prompt instructs the model to assess the answer: grade it, explain what was right and wrong, push back if the reasoning is incomplete, but don't just give the answer away. In teach mode (triggered by "I Don't Know"), the prompt switches: explain the concept fully and clearly, the user has already admitted they don't know it.

I tried a single prompt with a conditional instruction ("if the user says they don't know, explain instead of evaluate") and it worked sometimes, but the mode boundary was fuzzy. Splitting into explicit mode-specific prompts made the behavior reliable.

The Empty-Stream Bug

For a stretch in development, the tutor response streamed in as empty. The API route was firing, Claude was responding (visible in server logs), but the client rendered nothing.

The Vercel AI SDK v6 has two streaming response helpers: toUIMessageStreamResponse() and toTextStreamResponse(). I was using the former, which wraps the stream in a structured message envelope — metadata headers, message IDs, role annotations. That format is designed for the SDK's built-in client hooks (useChat, etc.) which parse the envelope automatically.

My client was reading the stream as raw text with a ReadableStream decoder. It read the envelope prefix, couldn't parse it as content, and produced an empty string. Switching to toTextStreamResponse() — which emits raw token chunks with no wrapper — fixed it. One line change after about two hours of debugging.

The deeper lesson: streaming APIs have implicit contracts between server response format and client reader. If you're managing message state yourself instead of using the SDK's hooks, use the simplest format. The structured message stream is powerful for the SDK's use case; it's friction for everyone else.

Monaco Editor and the SSR Problem

Coding and LeetCode-style questions render a Monaco editor — the VS Code engine. The UX reason is straightforward: if someone is practicing a coding interview, the input should feel like an IDE, not a textarea. Syntax highlighting and keyboard shortcuts matter.

The integration problem is that Monaco manipulates the DOM directly on load and has no server-side rendering path. In Next.js, a naive import crashes the build. The fix is dynamic import with ssr: false:

const MonacoEditor = dynamic(
  () => import('@monaco-editor/react'),
  { ssr: false }
);

This defers the Monaco bundle entirely to the client. The question type system ('text' | 'coding' | 'leetcode') determines which input renders. Text questions get a <textarea>; coding questions get Monaco initialized with the user's preferred language from the interview store.

The Readiness Score Gate

The mock interview is locked until the Readiness Score hits 70/100. The score is a weighted sum: each question carries a topic weight and a difficulty multiplier; passing adds points, failing deducts proportionally. Score is normalized to 0–100 and updated every three questions.

The gate is designed to prevent a specific failure mode: users who skip the tutor, go straight to the mock, fail, and don't come back. Users who've put in tutor time arrive at the mock with real confidence and better outcomes. Gating by demonstrated readiness — a behavioral signal — is more accurate than gating by time-on-page.

Mock mode enforces deliberate constraints: no hints, no "I Don't Know" option, timed responses. The final report breaks down scores by topic and links back to the tutor for targeted review.

Question History as a Derived View

The left panel of the tutor UI shows a list of answered questions with pass/fail indicators. Clicking one loads a read-only view of that question's full exchange.

The implementation is clean because of one early decision: every TutorMessage in the Zustand store is tagged with the questionId it belongs to. The review view is just a filter on the flat message array. There's no separate "review" store, no additional API calls — the conversation array is the source of truth and the question list is a derived view of it.

Tagging messages with question IDs seems obvious in hindsight. The alternative I almost went with — a parallel review store that duplicates the relevant messages — would have created a sync problem with no upside. Worth noting: the right data structure decision is the one that makes derived state cheap to compute.

What I'd Do Differently

Model the session as a state machine from day one. The five stages — onboarding, study plan, tutor, mock, results — have real transition guards (you can't enter mock without a 70+ readiness score; you can't enter tutor without a generated study plan). Right now those guards are scattered across component-level conditionals. An explicit state machine with typed transitions would make the flow much easier to test and extend.

Prototype the Tiger Parent prompt in isolation. I iterated the system prompt maybe a dozen times before it felt right. Earlier versions were either too passive (restating the question), too harsh (penalizing correct-but-imprecise answers), or too verbose (writing paragraphs when two sentences would do). If I were starting over, I'd prototype the persona against a fixed set of test questions in a standalone script before embedding it in the app. Faster feedback loop.

The readiness score weights need real data. Right now they're calibrated by feel, not against actual interview outcomes. The scoring formula is directionally right but the specific weights are arbitrary. Instrumenting pass/fail data at the mock stage and running even a simple regression would give the formula a real empirical basis.

h1nt is live at h1nt.vercel.app. Source on GitHub. If you use it and have thoughts, I want to hear them.

h1nt: Building a Tiger Parent AI Interview Coach