Gerenciamento de Janela de Contexto e Estratégias de Contexto Longo
A janela de contexto é o número máximo de tokens que um LLM pode processar de uma só vez. Um gerenciamento eficaz do contexto impacta diretamente o desempenho de aplicações de IA.
Limites de Janela de Contexto
Comparação de Modelos
| Modelo | Comprimento de Contexto | ~Palavras |
|---|---|---|
| GPT-3.5 Turbo | 16K | 12.000 |
| GPT Turbo | 128K | 96.000 |
| Claude 3 Opus | 200K | 150.000 |
| Gemini 1.5 Pro | 1M | 750.000 |
| Llama 3 | 8K-128K | 6-96K |
Cálculo de Tokens
1import tiktoken 2 3def count_tokens(text: str, model: str = "gpt-4") -> int: 4 encoding = tiktoken.encoding_for_model(model) 5 return len(encoding.encode(text)) 6 7def estimate_tokens(text: str) -> int: 8 # Quick estimate: ~4 chars = 1 token (English) 9 return len(text) // 4
Estratégias de Fragmentação (Chunking)
Fragmentação de Tamanho Fixo
1def fixed_size_chunk(text: str, chunk_size: int = 1000, overlap: int = 200) -> list: 2 chunks = [] 3 start = 0 4 5 while start < len(text): 6 end = start + chunk_size 7 chunk = text[start:end] 8 chunks.append(chunk) 9 start = end - overlap 10 11 return chunks
Fragmentação Semântica
1from langchain.text_splitter import RecursiveCharacterTextSplitter 2 3def semantic_chunk(text: str, chunk_size: int = 1000) -> list: 4 splitter = RecursiveCharacterTextSplitter( 5 chunk_size=chunk_size, 6 chunk_overlap=200, 7 separators=["\n\n", "\n", ". ", " ", ""], 8 length_function=len 9 ) 10 11 return splitter.split_text(text)
Sensível à Estrutura do Documento
1def structure_aware_chunk(document: str) -> list: 2 chunks = [] 3 current_section = "" 4 current_header = "" 5 6 for line in document.split("\n"): 7 # Header detection 8 if line.startswith("#"): 9 if current_section: 10 chunks.append({ 11 "header": current_header, 12 "content": current_section.strip() 13 }) 14 current_header = line 15 current_section = "" 16 else: 17 current_section += line + "\n" 18 19 if current_section: 20 chunks.append({ 21 "header": current_header, 22 "content": current_section.strip() 23 }) 24 25 return chunks
Compressão de Contexto
Sumarização
1def compress_context(text: str, max_tokens: int = 2000) -> str: 2 current_tokens = count_tokens(text) 3 4 if current_tokens <= max_tokens: 5 return text 6 7 # Summarize with LLM 8 response = client.chat.completions.create( 9 model="gpt-4-turbo", 10 messages=[ 11 { 12 "role": "system", 13 "content": f"Summarize the following text under {max_tokens} tokens. " 14 "Preserve important information." 15 }, 16 {"role": "user", "content": text} 17 ] 18 ) 19 20 return response.choices[0].message.content
Compressão Extrativa
1from sklearn.feature_extraction.text import TfidfVectorizer 2import numpy as np 3 4def extractive_compress(text: str, ratio: float = 0.3) -> str: 5 sentences = text.split(". ") 6 7 # Find important sentences with TF-IDF 8 vectorizer = TfidfVectorizer() 9 tfidf_matrix = vectorizer.fit_transform(sentences) 10 11 # Importance score of each sentence 12 scores = np.array(tfidf_matrix.sum(axis=1)).flatten() 13 14 # Select most important sentences 15 num_sentences = max(1, int(len(sentences) * ratio)) 16 top_indices = np.argsort(scores)[-num_sentences:] 17 top_indices = sorted(top_indices) # Preserve order 18 19 return ". ".join([sentences[i] for i in top_indices]) 20## Sliding Window 21 22### Gerenciamento de Histórico de Conversa 23 24```python 25class SlidingWindowMemory: 26 def __init__(self, max_tokens: int = 4000): 27 self.max_tokens = max_tokens 28 self.messages = [] 29 30 def add_message(self, role: str, content: str): 31 self.messages.append({"role": role, "content": content}) 32 self._trim() 33 34 def _trim(self): 35 while self._total_tokens() > self.max_tokens and len(self.messages) > 2: 36 # Preserve System message, delete oldest user/assistant 37 if self.messages[0]["role"] == "system": 38 self.messages.pop(1) 39 else: 40 self.messages.pop(0) 41 42 def _total_tokens(self) -> int: 43 return sum(count_tokens(m["content"]) for m in self.messages) 44 45 def get_messages(self) -> list: 46 return self.messages.copy()
Janela de Processamento de Documentos
1def process_long_document(document: str, query: str, window_size: int = 8000): 2 chunks = semantic_chunk(document, chunk_size=window_size) 3 results = [] 4 5 for i, chunk in enumerate(chunks): 6 response = client.chat.completions.create( 7 model="gpt-4-turbo", 8 messages=[ 9 { 10 "role": "system", 11 "content": "Analyze the given text chunk." 12 }, 13 { 14 "role": "user", 15 "content": f"Text:\n{chunk}\n\nQuestion: {query}" 16 } 17 ] 18 ) 19 20 results.append({ 21 "chunk_index": i, 22 "response": response.choices[0].message.content 23 }) 24 25 # Combine results 26 return synthesize_results(results, query)
Padrão Map-Reduce
QA para Documentos Longos
1def map_reduce_qa(document: str, question: str): 2 chunks = semantic_chunk(document, chunk_size=4000) 3 4 # Map: Analyze each chunk separately 5 partial_answers = [] 6 for chunk in chunks: 7 response = client.chat.completions.create( 8 model="gpt-4-turbo", 9 messages=[ 10 { 11 "role": "user", 12 "content": f"Text:\n{chunk}\n\nQuestion: {question}\n\n" 13 "Answer based on this text chunk. " 14 "If no information, say 'No information in this chunk'." 15 } 16 ] 17 ) 18 partial_answers.append(response.choices[0].message.content) 19 20 # Reduce: Combine answers 21 combined = "\n\n".join([ 22 f"Source {i+1}: {ans}" 23 for i, ans in enumerate(partial_answers) 24 ]) 25 26 final_response = client.chat.completions.create( 27 model="gpt-4-turbo", 28 messages=[ 29 { 30 "role": "user", 31 "content": f"Information from different sources:\n{combined}\n\n" 32 f"Question: {question}\n\n" 33 "Provide a comprehensive answer by synthesizing all information." 34 } 35 ] 36 ) 37 38 return final_response.choices[0].message.content 39## Recuperação Aumentada de Contexto 40 41### Seleção Inteligente de Contexto 42 43```python 44def select_relevant_context(query: str, documents: list, max_tokens: int = 4000): 45 # Embedding-based relevance 46 query_embedding = get_embedding(query) 47 48 scored_docs = [] 49 for doc in documents: 50 doc_embedding = get_embedding(doc["content"]) 51 score = cosine_similarity(query_embedding, doc_embedding) 52 scored_docs.append({"doc": doc, "score": score}) 53 54 # Sort by relevance 55 scored_docs.sort(key=lambda x: x["score"], reverse=True) 56 57 # Add until Token limit 58 selected = [] 59 current_tokens = 0 60 61 for item in scored_docs: 62 doc_tokens = count_tokens(item["doc"]["content"]) 63 if current_tokens + doc_tokens <= max_tokens: 64 selected.append(item["doc"]) 65 current_tokens += doc_tokens 66 else: 67 break 68 69 return selected
Boas Práticas para Contextos Longos
1. Posicionamento do Prompt
1def optimize_prompt_position(context: str, query: str) -> str: 2 """Put important information at start and end (Lost in the Middle)""" 3 4 chunks = semantic_chunk(context) 5 6 # Preserve first and last chunks 7 if len(chunks) > 2: 8 middle = chunks[1:-1] 9 compressed_middle = compress_context(" ".join(middle)) 10 context = f"{chunks[0]}\n\n{compressed_middle}\n\n{chunks[-1]}" 11 12 return f"Context:\n{context}\n\n---\n\nQuestion: {query}"
2. Processamento Hierárquico
1def hierarchical_summarize(document: str, levels: int = 2): 2 """Hierarchical summarization""" 3 4 current_text = document 5 6 for level in range(levels): 7 chunks = semantic_chunk(current_text, chunk_size=4000) 8 9 summaries = [] 10 for chunk in chunks: 11 summary = compress_context(chunk, max_tokens=500) 12 summaries.append(summary) 13 14 current_text = "\n\n".join(summaries) 15 16 return current_text
3. Attention Sinks
1def add_attention_anchors(prompt: str) -> str: 2 """Add attention anchors""" 3 4 return f""" 5[IMPORTANT START] 6{prompt[:500]} 7[/IMPORTANT] 8 9{prompt[500:-500]} 10 11[IMPORTANT END] 12{prompt[-500:]} 13[/IMPORTANT] 14"""
Monitoramento e Depuração
1class ContextMonitor: 2 def __init__(self): 3 self.logs = [] 4 5 def log_request(self, messages: list, model: str): 6 total_tokens = sum(count_tokens(m["content"]) for m in messages) 7 8 self.logs.append({ 9 "timestamp": datetime.now(), 10 "model": model, 11 "input_tokens": total_tokens, 12 "message_count": len(messages) 13 }) 14 15 # Alerts 16 if total_tokens > 100000: 17 print(f"⚠️ High token count: {total_tokens}") 18 19 def get_stats(self): 20 return { 21 "avg_tokens": np.mean([l["input_tokens"] for l in self.logs]), 22 "max_tokens": max(l["input_tokens"] for l in self.logs), 23 "total_requests": len(self.logs) 24 }
Conclusão
O gerenciamento da janela de contexto é fundamental para a escalabilidade e o custo de aplicações com LLM. É possível trabalhar de forma eficiente com documentos longos usando técnicas de chunking, compressão e estratégias inteligentes de recuperação.
Na Veni AI, desenvolvemos soluções de IA para contextos longos.
