Context Window

wojciech achtelik
Wojciech Achtelik
AI Engineer Lead
Published: July 3, 2025
Glossary Category
RAG

Context Window is the maximum span of recent tokens—words, sub-words, or bytes—that a language model can attend to when generating its next token. In a Transformer, this span equals the length of the input sequence plus the model’s own outputs, capped by a token limit set during training (e.g., 8 k for GPT-3.5, 1 M for Gemini 1.5 Pro). Anything beyond that cutoff is “forgotten,” so developers trim, chunk, or summarize long documents to fit. Retrieval-Augmented Generation (RAG) supplements the context window by injecting top-k results into the prompt, while sliding-window and recurrent attention extend effective memory in streaming apps. Tokenizers affect capacity: Byte-Pair Encoding may pack more words than character tokenization. Choosing the right window size balances latency, cost, and hallucination risk; larger windows reduce truncation but demand more GPU memory and can dilute attention on relevant content.

Want to learn how these AI concepts work in practice?

Understanding AI is one thing. Explore how we apply these AI principles to build scalable, agentic workflows that deliver real ROI and value for organizations.

Last updated: July 28, 2025