Veni AI
LLM Optimization

Context Window Management and Long Context Strategies

Guide to LLM context window limits, long context handling, chunking strategies, summarization, and context compression techniques.

Veni AI Technical TeamDecember 30, 20245 min read
Context Window Management and Long Context Strategies

Context Window Management and Long Context Strategies

Context window is the maximum number of tokens an LLM can process at once. Effective context management directly impacts the performance of AI applications.

Context Window Limits

Model Comparison

ModelContext Length~Words
GPT-3.5 Turbo16K12,000
GPT Turbo128K96,000
Claude 3 Opus200K150,000
Gemini 1.5 Pro1M750,000
Llama 38K-128K6-96K

Token Calculation

1import tiktoken 2 3def count_tokens(text: str, model: str = "gpt-4") -> int: 4 encoding = tiktoken.encoding_for_model(model) 5 return len(encoding.encode(text)) 6 7def estimate_tokens(text: str) -> int: 8 # Quick estimate: ~4 chars = 1 token (English) 9 return len(text) // 4

Chunking Strategies

Fixed-Size Chunking

1def fixed_size_chunk(text: str, chunk_size: int = 1000, overlap: int = 200) -> list: 2 chunks = [] 3 start = 0 4 5 while start < len(text): 6 end = start + chunk_size 7 chunk = text[start:end] 8 chunks.append(chunk) 9 start = end - overlap 10 11 return chunks

Semantic Chunking

1from langchain.text_splitter import RecursiveCharacterTextSplitter 2 3def semantic_chunk(text: str, chunk_size: int = 1000) -> list: 4 splitter = RecursiveCharacterTextSplitter( 5 chunk_size=chunk_size, 6 chunk_overlap=200, 7 separators=["\n\n", "\n", ". ", " ", ""], 8 length_function=len 9 ) 10 11 return splitter.split_text(text)

Document Structure Aware

1def structure_aware_chunk(document: str) -> list: 2 chunks = [] 3 current_section = "" 4 current_header = "" 5 6 for line in document.split("\n"): 7 # Header detection 8 if line.startswith("#"): 9 if current_section: 10 chunks.append({ 11 "header": current_header, 12 "content": current_section.strip() 13 }) 14 current_header = line 15 current_section = "" 16 else: 17 current_section += line + "\n" 18 19 if current_section: 20 chunks.append({ 21 "header": current_header, 22 "content": current_section.strip() 23 }) 24 25 return chunks

Context Compression

Summarization

1def compress_context(text: str, max_tokens: int = 2000) -> str: 2 current_tokens = count_tokens(text) 3 4 if current_tokens <= max_tokens: 5 return text 6 7 # Summarize with LLM 8 response = client.chat.completions.create( 9 model="gpt-4-turbo", 10 messages=[ 11 { 12 "role": "system", 13 "content": f"Summarize the following text under {max_tokens} tokens. " 14 "Preserve important information." 15 }, 16 {"role": "user", "content": text} 17 ] 18 ) 19 20 return response.choices[0].message.content

Extractive Compression

1from sklearn.feature_extraction.text import TfidfVectorizer 2import numpy as np 3 4def extractive_compress(text: str, ratio: float = 0.3) -> str: 5 sentences = text.split(". ") 6 7 # Find important sentences with TF-IDF 8 vectorizer = TfidfVectorizer() 9 tfidf_matrix = vectorizer.fit_transform(sentences) 10 11 # Importance score of each sentence 12 scores = np.array(tfidf_matrix.sum(axis=1)).flatten() 13 14 # Select most important sentences 15 num_sentences = max(1, int(len(sentences) * ratio)) 16 top_indices = np.argsort(scores)[-num_sentences:] 17 top_indices = sorted(top_indices) # Preserve order 18 19 return ". ".join([sentences[i] for i in top_indices])

Sliding Window

Conversation History Management

1class SlidingWindowMemory: 2 def __init__(self, max_tokens: int = 4000): 3 self.max_tokens = max_tokens 4 self.messages = [] 5 6 def add_message(self, role: str, content: str): 7 self.messages.append({"role": role, "content": content}) 8 self._trim() 9 10 def _trim(self): 11 while self._total_tokens() > self.max_tokens and len(self.messages) > 2: 12 # Preserve System message, delete oldest user/assistant 13 if self.messages[0]["role"] == "system": 14 self.messages.pop(1) 15 else: 16 self.messages.pop(0) 17 18 def _total_tokens(self) -> int: 19 return sum(count_tokens(m["content"]) for m in self.messages) 20 21 def get_messages(self) -> list: 22 return self.messages.copy()

Document Processing Window

1def process_long_document(document: str, query: str, window_size: int = 8000): 2 chunks = semantic_chunk(document, chunk_size=window_size) 3 results = [] 4 5 for i, chunk in enumerate(chunks): 6 response = client.chat.completions.create( 7 model="gpt-4-turbo", 8 messages=[ 9 { 10 "role": "system", 11 "content": "Analyze the given text chunk." 12 }, 13 { 14 "role": "user", 15 "content": f"Text:\n{chunk}\n\nQuestion: {query}" 16 } 17 ] 18 ) 19 20 results.append({ 21 "chunk_index": i, 22 "response": response.choices[0].message.content 23 }) 24 25 # Combine results 26 return synthesize_results(results, query)

Map-Reduce Pattern

Long Document QA

1def map_reduce_qa(document: str, question: str): 2 chunks = semantic_chunk(document, chunk_size=4000) 3 4 # Map: Analyze each chunk separately 5 partial_answers = [] 6 for chunk in chunks: 7 response = client.chat.completions.create( 8 model="gpt-4-turbo", 9 messages=[ 10 { 11 "role": "user", 12 "content": f"Text:\n{chunk}\n\nQuestion: {question}\n\n" 13 "Answer based on this text chunk. " 14 "If no information, say 'No information in this chunk'." 15 } 16 ] 17 ) 18 partial_answers.append(response.choices[0].message.content) 19 20 # Reduce: Combine answers 21 combined = "\n\n".join([ 22 f"Source {i+1}: {ans}" 23 for i, ans in enumerate(partial_answers) 24 ]) 25 26 final_response = client.chat.completions.create( 27 model="gpt-4-turbo", 28 messages=[ 29 { 30 "role": "user", 31 "content": f"Information from different sources:\n{combined}\n\n" 32 f"Question: {question}\n\n" 33 "Provide a comprehensive answer by synthesizing all information." 34 } 35 ] 36 ) 37 38 return final_response.choices[0].message.content

Retrieval Augmented Context

Smart Context Selection

1def select_relevant_context(query: str, documents: list, max_tokens: int = 4000): 2 # Embedding-based relevance 3 query_embedding = get_embedding(query) 4 5 scored_docs = [] 6 for doc in documents: 7 doc_embedding = get_embedding(doc["content"]) 8 score = cosine_similarity(query_embedding, doc_embedding) 9 scored_docs.append({"doc": doc, "score": score}) 10 11 # Sort by relevance 12 scored_docs.sort(key=lambda x: x["score"], reverse=True) 13 14 # Add until Token limit 15 selected = [] 16 current_tokens = 0 17 18 for item in scored_docs: 19 doc_tokens = count_tokens(item["doc"]["content"]) 20 if current_tokens + doc_tokens <= max_tokens: 21 selected.append(item["doc"]) 22 current_tokens += doc_tokens 23 else: 24 break 25 26 return selected

Long Context Best Practices

1. Prompt Positioning

1def optimize_prompt_position(context: str, query: str) -> str: 2 """Put important information at start and end (Lost in the Middle)""" 3 4 chunks = semantic_chunk(context) 5 6 # Preserve first and last chunks 7 if len(chunks) > 2: 8 middle = chunks[1:-1] 9 compressed_middle = compress_context(" ".join(middle)) 10 context = f"{chunks[0]}\n\n{compressed_middle}\n\n{chunks[-1]}" 11 12 return f"Context:\n{context}\n\n---\n\nQuestion: {query}"

2. Hierarchical Processing

1def hierarchical_summarize(document: str, levels: int = 2): 2 """Hierarchical summarization""" 3 4 current_text = document 5 6 for level in range(levels): 7 chunks = semantic_chunk(current_text, chunk_size=4000) 8 9 summaries = [] 10 for chunk in chunks: 11 summary = compress_context(chunk, max_tokens=500) 12 summaries.append(summary) 13 14 current_text = "\n\n".join(summaries) 15 16 return current_text

3. Attention Sinks

1def add_attention_anchors(prompt: str) -> str: 2 """Add attention anchors""" 3 4 return f""" 5[IMPORTANT START] 6{prompt[:500]} 7[/IMPORTANT] 8 9{prompt[500:-500]} 10 11[IMPORTANT END] 12{prompt[-500:]} 13[/IMPORTANT] 14"""

Monitoring and Debugging

1class ContextMonitor: 2 def __init__(self): 3 self.logs = [] 4 5 def log_request(self, messages: list, model: str): 6 total_tokens = sum(count_tokens(m["content"]) for m in messages) 7 8 self.logs.append({ 9 "timestamp": datetime.now(), 10 "model": model, 11 "input_tokens": total_tokens, 12 "message_count": len(messages) 13 }) 14 15 # Alerts 16 if total_tokens > 100000: 17 print(f"⚠️ High token count: {total_tokens}") 18 19 def get_stats(self): 20 return { 21 "avg_tokens": np.mean([l["input_tokens"] for l in self.logs]), 22 "max_tokens": max(l["input_tokens"] for l in self.logs), 23 "total_requests": len(self.logs) 24 }

Conclusion

Context window management is critical for the scalability and cost of LLM applications. You can work effectively with long documents using chunking, compression, and smart retrieval strategies.

At Veni AI, we develop long context AI solutions.

İlgili Makaleler