Context Window

Antoni Kozelski

CEO & Co-founder

Published: July 3, 2025

Glossary Category

RAG

Context Window is the maximum span of recent tokens—words, sub-words, or bytes—that a language model can attend to when generating its next token. In a Transformer, this span equals the length of the input sequence plus the model’s own outputs, capped by a token limit set during training (e.g., 8 k for GPT-3.5, 1 M for Gemini 1.5 Pro). Anything beyond that cutoff is “forgotten,” so developers trim, chunk, or summarize long documents to fit. Retrieval-Augmented Generation (RAG) supplements the context window by injecting top-k results into the prompt, while sliding-window and recurrent attention extend effective memory in streaming apps. Tokenizers affect capacity: Byte-Pair Encoding may pack more words than character tokenization. Choosing the right window size balances latency, cost, and hallucination risk; larger windows reduce truncation but demand more GPU memory and can dilute attention on relevant content.

Want to learn how these AI concepts work in practice?

Understanding AI is one thing. Explore how we apply these AI principles to build scalable, agentic workflows that deliver real ROI and value for organizations.

Last updated: August 1, 2025

Context Window

Want to learn how these AI concepts work in practice?

Related articles

Instant customer service. AI chatbots in e-commerce

Old-School Keyword Search to the Rescue When Your RAG Fails

Agentic AI Engineering Consultancy vs General Custom Software Developer: Pricing and Service Comparison 2025

When clean text is not enough: structured extraction for RAG

Context Window

Want to learn how these AI concepts work in practice?

Learn more AI terms

Related articles

Instant customer service. AI chatbots in e-commerce

Old-School Keyword Search to the Rescue When Your RAG Fails

Agentic AI Engineering Consultancy vs General Custom Software Developer: Pricing and Service Comparison 2025

When clean text is not enough: structured extraction for RAG