Back to Projects
🔎

Semantic Search Engine

Neural search engine using transformer-based embeddings for understanding query intent and delivering contextually relevant results.

Semantic SearchNLPTransformersVector SearchBERTEmbeddings

Overview

Spearheaded the development of a semantic search engine using BERT-based SentenceTransformers to vectorize movie subtitles and user queries, improving retrieval accuracy for 30,000 subtitles. Applied comprehensive data preprocessing and document chunking techniques to optimize search performance. Used cosine similarity for more accurate matches between query vectors and document embeddings. Managed ChromaDB embeddings to speed up data retrieval and enable efficient similarity search. The system transforms traditional keyword-based search into semantic understanding, allowing users to find relevant content even when exact keywords don't match.

Key Highlights

BERT-based SentenceTransformers for vectorizing movie subtitles and user queries

Improved retrieval accuracy for 30,000 subtitles using semantic understanding

Comprehensive data preprocessing and document chunking for optimized search performance

Cosine similarity for accurate matches between query vectors and document embeddings

ChromaDB embeddings management for fast data retrieval and efficient similarity search

Transforms keyword-based search into semantic understanding for better relevance

Finds relevant content even when exact keywords don't match

Practical application of transformer models and vector databases in search systems

Tech Stack

PythonBERTSentenceTransformersChromaDBCosine SimilarityNLP

🏗️ System Architecture

System Components

Query Input Interface

User interface for search queries

PythonWeb Interface

Document Processor

Preprocesses and chunks 30K subtitles

PythonNLPText Processing

BERT Embedder

SentenceTransformers for vectorization

BERTSentenceTransformersPyTorch

Vector Database

ChromaDB for storing embeddings

ChromaDBVector Storage

Similarity Calculator

Cosine similarity for matching

NumPyCosine Similarity

Results Ranker

Ranks and returns top results

PythonRanking Algorithm

Data Flow

Document ProcessorBERT Embedder

Preprocessed subtitle chunks

30K processed documents
BERT EmbedderVector Database

Generate and store embeddings

Vector embeddings
Query Input InterfaceBERT Embedder

User search query

Query text
BERT EmbedderSimilarity Calculator

Query embedding

Query vector
Vector DatabaseSimilarity Calculator

Retrieve stored embeddings

Document vectors
Similarity CalculatorResults Ranker

Calculate cosine similarity scores

Similarity scores
Results RankerQuery Input Interface

Return ranked results

Top matching subtitles

Architecture Flow

Query Input Interface

Document Processor

BERT Embedder

Vector Database

Similarity Calculator

Results Ranker

Kushal Adhyaru - AI/ML Engineer & Full-Stack Builder