
Every LLM conversation eventually hits a wall. History grows, explanations lengthen, users paste additional snippets, and tool outputs add noise. Yet models have inherent, fixed limits on how many tokens they can process. If you naively send the entire history every time, you eventually hit a wall, exceeding token limits, spiking costs, or suffering abrupt truncations that drop the very details your user cares about. For BittsAI, an AI powered learning platform, I needed a more agile system. I built a context manager that keeps the “working set” small and relevant, preserving long-term continuity without sacrificing performance during deep study sessions.
With BittsAI, I wanted context management to be predictable, easy to implement and user-first. This vision required three core guarantees:
Instead of letting the model hit a hard limit, I treated the entire context as a dynamic budget. This allowed me to proactively compress older parts of the conversation into a running incremental summary, managing memory gracefully.
At its core, the ContextManager transforms an unbounded chat history into a predictable, right-sized payload.
Lifecycle of a request
When building the ContextManager, RAG (Retrieval Augmented Generation) was an option. RAG uses vector embeddings and similarity search to retrieve relevant messages, which can surface specific facts from long conversations. For BittsAI, I chose incremental summarization for simplicity and fit. Here is why:
RAG would make sense if:
When summarization works better:
For BittsAI, summarization provides the right balance of simplicity and effectiveness. The system maintains coherent context without the operational overhead of vector infrastructure, while still preserving the essential information needed for meaningful learning conversations.
Accurate token counting is essential for budget decisions, but re-tokenizing every message on every request is slow. The system avoids this by caching token counts when messages are saved. When a user sends a message, the route handler calculates inputTokens using gpt-tokenizer and stores it in message.metadata.inputTokens. When the model finishes streaming, the AI SDK's onFinish event provides usage data including outputTokens, inputTokens, totalTokens, reasoningTokens, and cachedInputTokens, which are persisted to the assistant message's metadata. The ContextManager then uses a priority system: it trusts these cached token counts first, and only falls back to re-tokenization when metadata is missing. This keeps most requests fast while staying accurate when cached counts aren't available.
The ContextManager triggers summarization at 80% of the model's context window, not 100%. The 20% safety margin leaves headroom for the model's response, and accounts for token counting variance. If the total token count (messages + system prompt) is under this threshold, the system returns everything as-is which is the fastest path that avoids database queries and summarization calls. Once over 80%, it switches to the optimization path: enforcing the budget, summarizing older messages, and building the compressed context. This threshold balances keeping full context when possible and proactively compressing before hitting limits.
Here's the threshold calculation:
/**
* Calculate the threshold at which we trigger summarization.
* Set to 80% of context window to leave room for response and avoid hard cutoffs and account for token calculation variance.
*/
private calculateSummaryThreshold(): number {
return Math.floor(this.config.contextWindow * SUMMARY_THRESHOLD_PERCENTAGE);
}And the fast path check:
// If under threshold, return all messages without summarization
if (totalTokens <= summaryThreshold) {
return {
originalMessages: allMessages,
messages: allMessages,
summary: '',
totalTokens,
truncatedCount: 0,
};
}Once the threshold is crossed, the system enforces the token budget before summarizing. The enforceTokenBudget method starts with the existing summary and the new user message as the minimum required payload, then works backwards from the newest recent messages, adding each one if it fits. Messages that don't fit are marked as "dropped" and will be included in the summary. This ensures nothing is lost: dropped recent messages are preserved in the summary rather than discarded.
Here's the budget calculation:
/**
* Calculates available token budget for messages.
* Budget = contextWindow - maxOutputTokens - systemPromptTokens
*/
private getAvailableBudget(): number {
return this.config.contextWindow - this.config.maxOutputTokens - this.config.systemPromptTokens;
}And the enforcement logic:
private enforceTokenBudget(
recentMessages: ChatMessage[],
summaryTokens: number,
newMessage: ChatMessage
): { fitting: ChatMessage[]; dropped: ChatMessage[] } {
const availableBudget = this.getAvailableBudget();
const newMessageTokens = this.getMessageTokens(newMessage);
// Start with minimum required, summary + new message
let currentTokens = summaryTokens + newMessageTokens;
// If even the minimum doesn't fit, return all as dropped
if (currentTokens > availableBudget) {
return { fitting: [], dropped: recentMessages };
}
// Add recent messages from newest to oldest until budget exhausted
const fitting: ChatMessage[] = [];
const dropped: ChatMessage[] = [];
for (let i = recentMessages.length - 1; i >= 0; i--) {
const message = recentMessages[i];
const messageTokens = this.getMessageTokens(message);
if (currentTokens + messageTokens <= availableBudget) {
fitting.push(message);
currentTokens += messageTokens;
} else {
// This message and all older ones don't fit
// Collect dropped messages (from index 0 to i inclusive)
for (let j = 0; j <= i; j++) {
dropped.push(recentMessages[j]);
}
break;
}
}
// Restore order, oldest first
fitting.reverse();
return { fitting, dropped };
}The key insight is that budget enforcement happens first, before summarization. This determines which recent messages fit in context, and those that don't fit are combined with older unsummarized messages for summarization. This preserves continuity: even if a recent message is too large to keep verbatim, its content flows into the summary.
Following budget enforcement, the system combines older unsummarized messages with any dropped recent messages. These are merged and the summary is generated incrementally. If an existing summary exists, the model updates it by integrating new information and removing outdated content. This keeps the summary compact (targeting 200–400 tokens) while preserving important context. After generation, the summary is persisted to Supabase.
The summary generation uses fast and affordable models with a prompt optimized for educational conversations. It prioritizes study topics, knowledge levels, and key clarifications while excluding pleasantries and redundant information. The incremental approach means each summary update builds on the previous one, keeping the token count manageable even as conversations grow.
After generating a summary, the system persists it to Supabase and marks messages as summarized. This two-step process can drift if one step fails: messages could be marked as summarized while the summary isn't saved, or vice versa. To reduce drift, the summary upsert retries once on failure. If it still fails after retry, the error is logged but doesn't block the response; the next request will regenerate the summary if needed. When marking messages as summarized, the system uses batched parallel updates with Promise.allSettled to handle partial failures gracefully.
Even with retries and careful error handling, failures can occur. Database timeouts, summarization API errors, or edge cases where the summary plus the new message alone exceeds the budget. The ContextManager includes a fallback path that ensures the user always gets a response. If the optimized path fails at any point, it falls back to simpleRecencyTruncation: a straightforward algorithm that works backwards from the most recent messages, adding each one to the context until the budget is exhausted. This fallback preserves any existing summary if it fits, and uses cached data when available to avoid redundant database calls. The result is a response that may lose some older context, but still provides a coherent reply based on recent conversation.
/**
* Simple recency-based truncation fallback.
* Keeps as many recent messages as fit within the token budget.
* Used when optimized context building fails.
*/
simpleRecencyTruncation(messages: ChatMessage[], existingSummary?: string): ContextWindow {
const availableBudget = this.getAvailableBudget();
let currentTokens = 0;
const selectedMessages: ChatMessage[] = [];
// Account for summary tokens if we have one
const summaryTokens = this.getSummaryTokens(existingSummary);
currentTokens += summaryTokens;
// Work backwards from most recent, adding messages while under budget
for (let i = messages.length - 1; i >= 0; i--) {
const message = messages[i];
const messageTokens = this.getMessageTokens(message);
if (currentTokens + messageTokens <= availableBudget) {
selectedMessages.push(message);
currentTokens += messageTokens;
} else {
break;
}
}
selectedMessages.reverse();
return {
originalMessages: messages,
messages: selectedMessages,
summary: existingSummary || '',
totalTokens: currentTokens + this.config.systemPromptTokens,
truncatedCount: messages.length - selectedMessages.length,
};
}The fallback is triggered in two scenarios.
This ensures that even in failure scenarios, the user gets a response based on the most recent conversation context, maintaining a functional experience even when the sophisticated optimization path fails.
Building BittsAI's ContextManager showed that managing finite LLM context is a systems problem, not just prompt engineering. By treating context as a budget, counting tokens, prioritizing recent messages, and incrementally summarizing older content, the system keeps conversations coherent even as they grow to hundreds of messages. The key insight is enforcing the budget first, then summarizing what doesn't fit, ensuring nothing is lost. Combined with careful persistence logic that prevents drift, the system degrades gracefully even when things go wrong.
Choosing summarization over RAG prioritized simplicity and fit. For conversational learning apps, maintaining narrative flow matters more than retrieving specific facts verbatim. The result is a system that's easier to maintain, cheaper to run, and fast enough for real-time chat without the operational overhead of vector databases. For anyone building LLM-powered applications, context management isn't optional. Whether you choose summarization, RAG, or a hybrid approach, having a strategy for managing finite context windows is essential for building production-ready AI applications.