Semantic Search Engine
Neural search engine using transformer-based embeddings for understanding query intent and delivering contextually relevant results.
Overview
Spearheaded the development of a semantic search engine using BERT-based SentenceTransformers to vectorize movie subtitles and user queries, improving retrieval accuracy for 30,000 subtitles. Applied comprehensive data preprocessing and document chunking techniques to optimize search performance. Used cosine similarity for more accurate matches between query vectors and document embeddings. Managed ChromaDB embeddings to speed up data retrieval and enable efficient similarity search. The system transforms traditional keyword-based search into semantic understanding, allowing users to find relevant content even when exact keywords don't match.
Key Highlights
BERT-based SentenceTransformers for vectorizing movie subtitles and user queries
Improved retrieval accuracy for 30,000 subtitles using semantic understanding
Comprehensive data preprocessing and document chunking for optimized search performance
Cosine similarity for accurate matches between query vectors and document embeddings
ChromaDB embeddings management for fast data retrieval and efficient similarity search
Transforms keyword-based search into semantic understanding for better relevance
Finds relevant content even when exact keywords don't match
Practical application of transformer models and vector databases in search systems
Tech Stack
Project Links
🏗️ System Architecture
System Components
Query Input Interface
User interface for search queries
Document Processor
Preprocesses and chunks 30K subtitles
BERT Embedder
SentenceTransformers for vectorization
Vector Database
ChromaDB for storing embeddings
Similarity Calculator
Cosine similarity for matching
Results Ranker
Ranks and returns top results
Data Flow
Preprocessed subtitle chunks
30K processed documentsGenerate and store embeddings
Vector embeddingsUser search query
Query textQuery embedding
Query vectorRetrieve stored embeddings
Document vectorsCalculate cosine similarity scores
Similarity scoresReturn ranked results
Top matching subtitlesArchitecture Flow
Query Input Interface
Document Processor
BERT Embedder
Vector Database
Similarity Calculator
Results Ranker