Hantering av kontextfönster och strategier för långt kontext
Kontextfönster är det maximala antalet token som en LLM kan bearbeta samtidigt. Effektiv kontexthantering påverkar direkt prestandan hos AI‑applikationer.
Begränsningar för kontextfönster
Modelljämförelse
| Modell | Kontextlängd | ~Ord |
|---|---|---|
| GPT-3.5 Turbo | 16K | 12,000 |
| GPT Turbo | 128K | 96,000 |
| Claude 3 Opus | 200K | 150,000 |
| Gemini 1.5 Pro | 1M | 750,000 |
| Llama 3 | 8K-128K | 6-96K |
Token‑beräkning
1import tiktoken 2 3def count_tokens(text: str, model: str = "gpt-4") -> int: 4 encoding = tiktoken.encoding_for_model(model) 5 return len(encoding.encode(text)) 6 7def estimate_tokens(text: str) -> int: 8 # Quick estimate: ~4 chars = 1 token (English) 9 return len(text) // 4
Chunking‑strategier
Chunking med fast storlek
1def fixed_size_chunk(text: str, chunk_size: int = 1000, overlap: int = 200) -> list: 2 chunks = [] 3 start = 0 4 5 while start < len(text): 6 end = start + chunk_size 7 chunk = text[start:end] 8 chunks.append(chunk) 9 start = end - overlap 10 11 return chunks
Semantisk chunking
1from langchain.text_splitter import RecursiveCharacterTextSplitter 2 3def semantic_chunk(text: str, chunk_size: int = 1000) -> list: 4 splitter = RecursiveCharacterTextSplitter( 5 chunk_size=chunk_size, 6 chunk_overlap=200, 7 separators=["\n\n", "\n", ". ", " ", ""], 8 length_function=len 9 ) 10 11 return splitter.split_text(text)
Dokumentstrukturmedveten
1def structure_aware_chunk(document: str) -> list: 2 chunks = [] 3 current_section = "" 4 current_header = "" 5 6 for line in document.split("\n"): 7 # Header detection 8 if line.startswith("#"): 9 if current_section: 10 chunks.append({ 11 "header": current_header, 12 "content": current_section.strip() 13 }) 14 current_header = line 15 current_section = "" 16 else: 17 current_section += line + "\n" 18 19 if current_section: 20 chunks.append({ 21 "header": current_header, 22 "content": current_section.strip() 23 }) 24 25 return chunks
Kontextkomprimering
Summering
1def compress_context(text: str, max_tokens: int = 2000) -> str: 2 current_tokens = count_tokens(text) 3 4 if current_tokens <= max_tokens: 5 return text 6 7 # Summarize with LLM 8 response = client.chat.completions.create( 9 model="gpt-4-turbo", 10 messages=[ 11 { 12 "role": "system", 13 "content": f"Summarize the following text under {max_tokens} tokens. " 14 "Preserve important information." 15 }, 16 {"role": "user", "content": text} 17 ] 18 ) 19 20 return response.choices[0].message.content
Extraktiv komprimering
1from sklearn.feature_extraction.text import TfidfVectorizer 2import numpy as np 3 4def extractive_compress(text: str, ratio: float = 0.3) -> str: 5 sentences = text.split(". ") 6 7 # Find important sentences with TF-IDF 8 vectorizer = TfidfVectorizer() 9 tfidf_matrix = vectorizer.fit_transform(sentences) 10 11 # Importance score of each sentence 12 scores = np.array(tfidf_matrix.sum(axis=1)).flatten() 13 14 # Select most important sentences 15 num_sentences = max(1, int(len(sentences) * ratio)) 16 top_indices = np.argsort(scores)[-num_sentences:] 17 top_indices = sorted(top_indices) # Preserve order 18 19 return ". ".join([sentences[i] for i in top_indices]) 20## Sliding Window 21 22### Hantering av konversationshistorik 23 24```python 25class SlidingWindowMemory: 26 def __init__(self, max_tokens: int = 4000): 27 self.max_tokens = max_tokens 28 self.messages = [] 29 30 def add_message(self, role: str, content: str): 31 self.messages.append({"role": role, "content": content}) 32 self._trim() 33 34 def _trim(self): 35 while self._total_tokens() > self.max_tokens and len(self.messages) > 2: 36 # Preserve System message, delete oldest user/assistant 37 if self.messages[0]["role"] == "system": 38 self.messages.pop(1) 39 else: 40 self.messages.pop(0) 41 42 def _total_tokens(self) -> int: 43 return sum(count_tokens(m["content"]) for m in self.messages) 44 45 def get_messages(self) -> list: 46 return self.messages.copy()
Fönster för dokumentbearbetning
1def process_long_document(document: str, query: str, window_size: int = 8000): 2 chunks = semantic_chunk(document, chunk_size=window_size) 3 results = [] 4 5 for i, chunk in enumerate(chunks): 6 response = client.chat.completions.create( 7 model="gpt-4-turbo", 8 messages=[ 9 { 10 "role": "system", 11 "content": "Analyze the given text chunk." 12 }, 13 { 14 "role": "user", 15 "content": f"Text:\n{chunk}\n\nQuestion: {query}" 16 } 17 ] 18 ) 19 20 results.append({ 21 "chunk_index": i, 22 "response": response.choices[0].message.content 23 }) 24 25 # Combine results 26 return synthesize_results(results, query)
Map-Reduce-mönster
Frågor och svar för långa dokument
1def map_reduce_qa(document: str, question: str): 2 chunks = semantic_chunk(document, chunk_size=4000) 3 4 # Map: Analyze each chunk separately 5 partial_answers = [] 6 for chunk in chunks: 7 response = client.chat.completions.create( 8 model="gpt-4-turbo", 9 messages=[ 10 { 11 "role": "user", 12 "content": f"Text:\n{chunk}\n\nQuestion: {question}\n\n" 13 "Answer based on this text chunk. " 14 "If no information, say 'No information in this chunk'." 15 } 16 ] 17 ) 18 partial_answers.append(response.choices[0].message.content) 19 20 # Reduce: Combine answers 21 combined = "\n\n".join([ 22 f"Source {i+1}: {ans}" 23 for i, ans in enumerate(partial_answers) 24 ]) 25 26 final_response = client.chat.completions.create( 27 model="gpt-4-turbo", 28 messages=[ 29 { 30 "role": "user", 31 "content": f"Information from different sources:\n{combined}\n\n" 32 f"Question: {question}\n\n" 33 "Provide a comprehensive answer by synthesizing all information." 34 } 35 ] 36 ) 37 38 return final_response.choices[0].message.content 39## Retrieval Augmented Context 40 41### Smart Context Selection 42 43```python 44def select_relevant_context(query: str, documents: list, max_tokens: int = 4000): 45 # Embedding-based relevance 46 query_embedding = get_embedding(query) 47 48 scored_docs = [] 49 for doc in documents: 50 doc_embedding = get_embedding(doc["content"]) 51 score = cosine_similarity(query_embedding, doc_embedding) 52 scored_docs.append({"doc": doc, "score": score}) 53 54 # Sort by relevance 55 scored_docs.sort(key=lambda x: x["score"], reverse=True) 56 57 # Add until Token limit 58 selected = [] 59 current_tokens = 0 60 61 for item in scored_docs: 62 doc_tokens = count_tokens(item["doc"]["content"]) 63 if current_tokens + doc_tokens <= max_tokens: 64 selected.append(item["doc"]) 65 current_tokens += doc_tokens 66 else: 67 break 68 69 return selected
Best Practices för Långa Kontexter
1. Placering av Prompt
1def optimize_prompt_position(context: str, query: str) -> str: 2 """Put important information at start and end (Lost in the Middle)""" 3 4 chunks = semantic_chunk(context) 5 6 # Preserve first and last chunks 7 if len(chunks) > 2: 8 middle = chunks[1:-1] 9 compressed_middle = compress_context(" ".join(middle)) 10 context = f"{chunks[0]}\n\n{compressed_middle}\n\n{chunks[-1]}" 11 12 return f"Context:\n{context}\n\n---\n\nQuestion: {query}"
2. Hierarkisk Bearbetning
1def hierarchical_summarize(document: str, levels: int = 2): 2 """Hierarchical summarization""" 3 4 current_text = document 5 6 for level in range(levels): 7 chunks = semantic_chunk(current_text, chunk_size=4000) 8 9 summaries = [] 10 for chunk in chunks: 11 summary = compress_context(chunk, max_tokens=500) 12 summaries.append(summary) 13 14 current_text = "\n\n".join(summaries) 15 16 return current_text
3. Attention Sinks
1def add_attention_anchors(prompt: str) -> str: 2 """Add attention anchors""" 3 4 return f""" 5[IMPORTANT START] 6{prompt[:500]} 7[/IMPORTANT] 8 9{prompt[500:-500]} 10 11[IMPORTANT END] 12{prompt[-500:]} 13[/IMPORTANT] 14"""
Övervakning och Felsökning
1class ContextMonitor: 2 def __init__(self): 3 self.logs = [] 4 5 def log_request(self, messages: list, model: str): 6 total_tokens = sum(count_tokens(m["content"]) for m in messages) 7 8 self.logs.append({ 9 "timestamp": datetime.now(), 10 "model": model, 11 "input_tokens": total_tokens, 12 "message_count": len(messages) 13 }) 14 15 # Alerts 16 if total_tokens > 100000: 17 print(f"⚠️ High token count: {total_tokens}") 18 19 def get_stats(self): 20 return { 21 "avg_tokens": np.mean([l["input_tokens"] for l in self.logs]), 22 "max_tokens": max(l["input_tokens"] for l in self.logs), 23 "total_requests": len(self.logs) 24 }
Slutsats
Hantering av kontextfönster är avgörande för skala och kostnad i LLM-applikationer. Du kan arbeta effektivt med långa dokument genom att använda chunking, komprimering och smarta återhämtningsstrategier.
På Veni AI utvecklar vi AI‑lösningar för långa kontexter.
