Lessons Learned from Week 3 of the LLM Zoomcamp: Vector Search and Embeddings
In Week 3 of the LLM Zoomcamp, I had the opportunity to dive deep into the fascinating world of vector search and embeddings. These concepts are crucial for various applications in machine learning and natural language processing, such as information retrieval, recommendation systems, and semantic search. Here’s a detailed account of what I learned this week.
Understanding Embeddings
Embeddings are dense vector representations of text data that capture semantic meaning. They are essential for transforming textual data into a format that machine learning models can process. This week, we focused on creating embeddings using the multi-qa-distilbert-cos-v1
model from the Sentence Transformer library.
Key Takeaways:
- What are Embeddings?: Embeddings convert text into numerical vectors that capture the semantic meaning of the text.
- Types of Embeddings: We explored different types of embeddings, including word embeddings and sentence embeddings.
- Creating Embeddings: We learned how to create embeddings using the Sentence Transformer library.
Here’s a simple example of creating embeddings:
from sentence_transformers import SentenceTransformer
# Load the model
model_name = "multi-qa-distilbert-cos-v1"
embedding_model = SentenceTransformer(model_name)# Define a sample text
text = "I just discovered the course. Can I still join it?"# Create the embedding
embedding = embedding_model.encode(text)# Print the embedding
print(embedding)
Vector Search
Vector search involves finding the most similar vectors to a given query vector. This is achieved using cosine similarity, which measures the cosine of the angle between two vectors. Vector search is a powerful technique used in various applications, including search engines and recommendation systems.
Key Takeaways:
- What is Vector Search?: Vector search is the process of finding the most similar vectors to a given query vector.
- Cosine Similarity: We learned about cosine similarity and how it is used to measure the similarity between vectors.
- Practical Implementation: We implemented vector search using the dot product of the query vector and the document embeddings.
Here’s an example of performing vector search:
import numpy as np
# Assume X is the matrix of document embeddings and v is the query vector
scores = X.dot(v)# Get the indices of the top results
top_indices = np.argsort(-scores)[:5]# Print the top results
for idx in top_indices:
print(documents[idx])
Implementing a Vector Search Engine
We took our understanding of vector search a step further by implementing a vector search engine. This involved creating a class that encapsulates the functionality of searching through document embeddings.
Key Takeaways:
- Vector Search Engine: We implemented a vector search engine to efficiently search through document embeddings.
- Class Implementation: We encapsulated the search functionality within a class for better organization and reusability.
Here’s the implementation of the VectorSearchEngine
class
class VectorSearchEngine():
def __init__(self, documents, embeddings):
self.documents = documents
self.embeddings = embeddings
def search(self, v_query, num_results=10):
scores = self.embeddings.dot(v_query)
idx = np.argsort(-scores)[:num_results]
return [self.documents[i] for i in idx]
Evaluating the Search Engine
To evaluate the performance of our search engine, we used the hit-rate metric. The hit-rate measures the proportion of queries for which the correct document is among the top results. This metric is crucial for understanding the effectiveness of our search engine.
Key Takeaways:
- Hit-Rate Metric: We learned about the hit-rate metric and its importance in evaluating search engines.
- Practical Evaluation: We implemented a function to calculate the hit-rate of our vector search engine.
Here’s the code to calculate the hit-rate:
def calculate_hitrate(search_engine, ground_truth, num_results=5):
hits = 0
for item in ground_truth:
query = item['question']
true_id = item['document']
v_query = embedding_model.encode(query)
results = search_engine.search(v_query, num_results=num_results)
result_ids = [res['id'] for res in results]
if true_id in result_ids:
hits += 1
return hits / len(ground_truth)
# Calculate the hit-rate
hitrate = calculate_hitrate(search_engine, ground_truth, num_results=5)
print(f"Hit-rate: {hitrate:.2f}")
Challenges and Solutions
One of the challenges I faced was ensuring that the document IDs were correctly matched between the ground truth data and the search results. By carefully inspecting the data structure, I was able to resolve this issue and achieve a hit-rate of 0.53.
Key Takeaways:
- Data Inspection: It’s crucial to inspect the data structure to ensure correct matching of document IDs.
- Debugging: Effective debugging techniques can help resolve issues and improve the performance of the search engine.
Conclusion
Week 3 of the LLM Zoomcamp provided valuable insights into vector search and embeddings. These concepts are foundational for many advanced applications in machine learning. I look forward to applying these learnings in future projects and exploring these topics further.
Additional Resources
I hope you found this article insightful. Feel free to share your thoughts and experiences in the comments below!