Veni AI
Τεχνολογίες Αναζήτησης

Σύγκριση Σημασιολογικής Αναζήτησης και Μοντέλων Ενσωμάτωσης

Συστήματα σημασιολογικής αναζήτησης, σύγκριση δημοφιλών μοντέλων ενσωμάτωσης, αποτελέσματα αξιολόγησης και οδηγός για λύσεις εταιρικής αναζήτησης.

Veni AI Technical Team2 Ocak 20255 dk okuma
Σύγκριση Σημασιολογικής Αναζήτησης και Μοντέλων Ενσωμάτωσης

Συγκριτική Ανάλυση Semantic Search και Embedding Models

Το semantic search είναι ένα σύστημα αναζήτησης που βασίζεται στη σημασιολογική ομοιότητα και όχι στην αντιστοίχιση λέξεων-κλειδιών. Σε αυτόν τον οδηγό, εξετάζουμε embedding models και υλοποιήσεις semantic search.

Keyword vs Semantic Search

1Keyword Search: 2Query: "cheap phone" 3Result: Only documents containing "cheap" and "phone" 4 5Semantic Search: 6Query: "cheap phone" 7Result: "budget-friendly smart device", "economical smartphone", 8 "affordable mobile" and semantically similar documents

Σύγκριση Embedding Models

Δημοφιλή Μοντέλα

ModelDimensionMax TokensTurkishMTEB Score
text-embedding-3-large30728191Good64.6
text-embedding-3-small15368191Good62.3
Cohere embed-v31024512Medium64.5
BGE-M310248192Very Good63.2
E5-mistral-7b409632768Good66.6
mxbai-embed-large1024512Good64.7

OpenAI Embedding

1from openai import OpenAI 2 3client = OpenAI() 4 5def get_embedding(text: str, model: str = "text-embedding-3-large"): 6 response = client.embeddings.create( 7 input=text, 8 model=model, 9 dimensions=1024 # Dimension reduction (optional) 10 ) 11 return response.data[0].embedding 12 13# Batch embedding 14def get_embeddings_batch(texts: list[str]): 15 response = client.embeddings.create( 16 input=texts, 17 model="text-embedding-3-large" 18 ) 19 return [item.embedding for item in response.data]

Cohere Embedding

1import cohere 2 3co = cohere.Client("api-key") 4 5def get_cohere_embedding(texts: list[str], input_type: str = "search_document"): 6 response = co.embed( 7 texts=texts, 8 model="embed-multilingual-v3.0", 9 input_type=input_type # search_document or search_query 10 ) 11 return response.embeddings

Sentence Transformers (Local)

1from sentence_transformers import SentenceTransformer 2 3model = SentenceTransformer("BAAI/bge-m3") 4 5# Single text 6embedding = model.encode("Hello world") 7 8# Batch 9embeddings = model.encode([ 10 "First text", 11 "Second text", 12 "Third text" 13], batch_size=32, show_progress_bar=True)

Υλοποίηση Semantic Search

Απλή Cosine Similarity

1import numpy as np 2 3def cosine_similarity(a, b): 4 return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) 5 6def search(query: str, documents: list[dict], top_k: int = 5): 7 query_embedding = get_embedding(query) 8 9 results = [] 10 for doc in documents: 11 similarity = cosine_similarity(query_embedding, doc["embedding"]) 12 results.append({ 13 "document": doc, 14 "score": similarity 15 }) 16 17 results.sort(key=lambda x: x["score"], reverse=True) 18 return results[:top_k]

Vector Database Search

1from pinecone import Pinecone 2 3pc = Pinecone(api_key="xxx") 4index = pc.Index("semantic-search") 5 6def semantic_search(query: str, top_k: int = 10, filter: dict = None): 7 query_embedding = get_embedding(query) 8 9 results = index.query( 10 vector=query_embedding, 11 top_k=top_k, 12 filter=filter, 13 include_metadata=True 14 ) 15 16 return [ 17 { 18 "id": match.id, 19 "score": match.score, 20 "metadata": match.metadata 21 } 22 for match in results.matches 23 ] 24## Υβριδική Αναζήτηση 25 26Συνδυασμός Keyword + Semantic search: 27 28```python 29from rank_bm25 import BM25Okapi 30 31class HybridSearch: 32 def __init__(self, documents): 33 self.documents = documents 34 35 # BM25 index 36 tokenized = [doc["text"].split() for doc in documents] 37 self.bm25 = BM25Okapi(tokenized) 38 39 # Embeddings 40 texts = [doc["text"] for doc in documents] 41 self.embeddings = get_embeddings_batch(texts) 42 43 def search(self, query: str, top_k: int = 10, alpha: float = 0.5): 44 # BM25 scores 45 bm25_scores = self.bm25.get_scores(query.split()) 46 bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-6) 47 48 # Semantic scores 49 query_emb = get_embedding(query) 50 semantic_scores = [ 51 cosine_similarity(query_emb, emb) 52 for emb in self.embeddings 53 ] 54 semantic_scores = np.array(semantic_scores) 55 56 # Hybrid score 57 hybrid_scores = alpha * semantic_scores + (1 - alpha) * bm25_scores 58 59 # Sort and return 60 top_indices = np.argsort(hybrid_scores)[::-1][:top_k] 61 62 return [ 63 { 64 "document": self.documents[i], 65 "score": hybrid_scores[i], 66 "bm25_score": bm25_scores[i], 67 "semantic_score": semantic_scores[i] 68 } 69 for i in top_indices 70 ]

Reranking

Πιο ακριβής ταξινόμηση των αρχικών αποτελεσμάτων:

1import cohere 2 3co = cohere.Client("api-key") 4 5def rerank_results(query: str, documents: list[str], top_k: int = 10): 6 response = co.rerank( 7 query=query, 8 documents=documents, 9 model="rerank-multilingual-v3.0", 10 top_n=top_k 11 ) 12 13 return [ 14 { 15 "index": result.index, 16 "text": documents[result.index], 17 "score": result.relevance_score 18 } 19 for result in response.results 20 ] 21 22# Pipeline: Retrieve → Rerank 23def search_with_rerank(query: str, top_k: int = 5): 24 # Step 1: Get more candidates 25 candidates = semantic_search(query, top_k=top_k * 3) 26 27 # Step 2: Rerank 28 docs = [c["metadata"]["text"] for c in candidates] 29 reranked = rerank_results(query, docs, top_k=top_k) 30 31 return reranked

Κατανόηση Ερωτήματος

Query Expansion

1def expand_query(query: str) -> list[str]: 2 """Query expansion with LLM""" 3 response = client.chat.completions.create( 4 model="gpt-4-turbo", 5 messages=[ 6 { 7 "role": "system", 8 "content": "Generate 3 different variations of the given search query." 9 }, 10 {"role": "user", "content": query} 11 ] 12 ) 13 14 variations = response.choices[0].message.content.split("\n") 15 return [query] + variations

HyDE (Hypothetical Document Embeddings)

1def hyde_search(query: str, top_k: int = 5): 2 """Generate hypothetical document and get embedding""" 3 4 # Generate hypothetical document 5 response = client.chat.completions.create( 6 model="gpt-4-turbo", 7 messages=[ 8 { 9 "role": "system", 10 "content": "Write a paragraph that answers this question." 11 }, 12 {"role": "user", "content": query} 13 ] 14 ) 15 16 hypothetical_doc = response.choices[0].message.content 17 18 # Embed the hypothetical document 19 hyde_embedding = get_embedding(hypothetical_doc) 20 21 # Search with this embedding 22 return vector_search(hyde_embedding, top_k) 23## Μετρικές Αξιολόγησης 24 25### Μετρικές Ανάκτησης 26 27```python 28def calculate_metrics(retrieved: list, relevant: list, k: int): 29 """Calculate Precision@K, Recall@K, MRR""" 30 31 # Precision@K 32 retrieved_k = retrieved[:k] 33 relevant_in_k = len(set(retrieved_k) & set(relevant)) 34 precision_k = relevant_in_k / k 35 36 # Recall@K 37 recall_k = relevant_in_k / len(relevant) 38 39 # MRR 40 mrr = 0 41 for i, doc in enumerate(retrieved): 42 if doc in relevant: 43 mrr = 1 / (i + 1) 44 break 45 46 return { 47 "precision@k": precision_k, 48 "recall@k": recall_k, 49 "mrr": mrr 50 }

Βελτιστοποιήσεις Παραγωγής

Embedding Cache

1import hashlib 2import redis 3 4redis_client = redis.Redis() 5 6def get_embedding_cached(text: str, model: str = "text-embedding-3-large"): 7 cache_key = f"emb:{model}:{hashlib.md5(text.encode()).hexdigest()}" 8 9 cached = redis_client.get(cache_key) 10 if cached: 11 return json.loads(cached) 12 13 embedding = get_embedding(text, model) 14 redis_client.setex(cache_key, 86400, json.dumps(embedding)) # 24h TTL 15 16 return embedding

Επεξεργασία σε Παρτίδες

1async def process_documents_async(documents: list[dict], batch_size: int = 100): 2 """Async batch embedding""" 3 4 async def process_batch(batch): 5 texts = [doc["text"] for doc in batch] 6 embeddings = await async_get_embeddings(texts) 7 8 for doc, emb in zip(batch, embeddings): 9 doc["embedding"] = emb 10 11 return batch 12 13 tasks = [] 14 for i in range(0, len(documents), batch_size): 15 batch = documents[i:i + batch_size] 16 tasks.append(process_batch(batch)) 17 18 results = await asyncio.gather(*tasks) 19 return [doc for batch in results for doc in batch]

Συμπέρασμα

Η σημασιολογική αναζήτηση είναι μια ισχυρή τεχνολογία που βελτιώνει σημαντικά την εμπειρία χρήστη. Με τη σωστή επιλογή embedding model, hybrid search και reranking, μπορείτε να δημιουργήσετε συστήματα αναζήτησης υψηλής ποιότητας.

Ως Veni AI, αναπτύσσουμε enterprise λύσεις σημασιολογικής αναζήτησης.

İlgili Makaleler