Context Window
Context Window is the maximum span of recent tokens—words, sub-words, or bytes—that a language model can attend to when generating its next token. In a Transformer, this span equals the length of the input sequence plus the model’s own outputs, capped by a token limit set during training (e.g., 8 k for GPT-3.5, 1 M for Gemini 1.5 Pro). Anything beyond that cutoff is “forgotten,” so developers trim, chunk, or summarize long documents to fit. Retrieval-Augmented Generation (RAG) supplements the context window by injecting top-k results into the prompt, while sliding-window and recurrent attention extend effective memory in streaming apps. Tokenizers affect capacity: Byte-Pair Encoding may pack more words than character tokenization. Choosing the right window size balances latency, cost, and hallucination risk; larger windows reduce truncation but demand more GPU memory and can dilute attention on relevant content.