Skip to main content

Build Your Own Search Engine

"A search engine is just an inverted index with a ranking function." — every IR class, day 1

Building a search engine is the cleanest way to learn the algorithms that quietly run a meaningful portion of the modern web: tokenization, stemming, inverted indexes, TF-IDF scoring, and (optionally) vector search and learning-to-rank.


1. Overview & motivation

A search pipeline:

documents → [tokenize] → terms
→ [normalize: lowercase, stem, stop-words]
→ [invert] → posting lists (term → list of doc IDs)
→ [score] → ranked results (TF-IDF or BM25)

query → tokenize → look up posting lists → merge → rank → top-K

What you can only learn by building one:

  • Why the inverted index is the central data structure of every text search system, from Lucene to Google.
  • Why TF-IDF is so durable — it captures something true about how language works.
  • Why BM25 beats TF-IDF in most modern systems (it's TF-IDF with two principled adjustments).
  • Why vector search (semantic similarity via embeddings) is replacing or augmenting traditional keyword search.
  • Why PageRank belongs to a different family (link-based) and complements text-based ranking.

2. Where this fits in the degree

  • Phase: Architecture
  • Semester: 6 (Databases and Distributed Systems) or 7 (Architecture and DDD)
  • Modules deepened: Sem 6 Module 2 (storage & indexing — inverted index is the textbook indexing problem). Sem 7 Module 2 (modular architecture — the search pipeline is a textbook layered design).

Cross-phase relevance:


3. Prerequisites

  • A general-purpose language: Python (easiest), Go, Java.
  • Basic understanding of TF-IDF (we'll re-derive it here).
  • Comfort with sorting and merging sorted lists.

4. Theory & research

Required reading

  • Manning, Raghavan, Schütze, Introduction to Information Retrieval — free online (nlp.stanford.edu/IR-book). The single canonical IR textbook. Chapters 1–6 cover everything in this tutorial. ⭐ start here.
  • Croft, Metzler, Strohman, Search Engines: Information Retrieval in Practice — alternative textbook. More implementation-focused.
  • Doug Turnbull & John Berryman, Relevant Search — practical book on Elasticsearch / Solr / Lucene-style search relevance.
  • Robertson & Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond" (2009) — the canonical BM25 reference.

For modern (neural / vector) IR

  • Yates, Nogueira, Lin, "Pretrained Transformers for Text Ranking: BERT and Beyond" (2021) — comprehensive survey.
  • Faiss / HNSW — algorithms for approximate-nearest-neighbour vector search.

Implementation references

  • Apache Lucene source code — the canonical inverted index implementation. Java. The data structures (FST, skip lists, postings codec) are textbook examples.
  • Tantivy — Rust port of Lucene's design. More readable than Lucene.

5. Curated tutorial list (from BYO-X)

  • CSS: A search engine in CSS — gimmick, not really educational
  • Python: Building a search engine using Redis and redis-py — uses Redis sorted sets as posting lists
  • Python: Building a Vector Space Indexing Engine in Python — covers TF-IDF and cosine similarity ⭐ recommended primary
  • Python: Building A Python-Based Search Engine [video]
  • Python: Making text search learn from feedback
  • Python: Finding Important Words in Text Using TF-IDF

Additional canonical references (outside BYO-X)


Build it from scratch in Python, following Manning et al. as your textbook and Bart de Goede's repo as your reference implementation.

Phases:

  1. Tokenizer + simple inverted index over a small corpus.
  2. TF-IDF scoring.
  3. BM25 scoring.
  4. Optional: stemming, stop words, phrase queries, vector search.

This is the recommended path because IR techniques are all about understanding why. Following someone else's code without understanding the math leaves you unable to debug ranking issues.

For a more ready-made path: start from Bart de Goede's repo and modify.


7. Implementation milestones

Milestone 1: Corpus and tokenizer

Choose a corpus: Wikipedia dump, Project Gutenberg subset, your own documents. Read each document; split into terms.

import re

def tokenize(text):
return [t.lower() for t in re.findall(r"\w+", text)]

This naive tokenizer is enough for milestone 1. Real tokenizers handle Unicode, contractions, URLs, numbers.

Evidence: Tokenize a paragraph; print the term list.

Milestone 2: Inverted index

from collections import defaultdict

class Index:
def __init__(self):
self.postings = defaultdict(list) # term -> list of doc_ids
self.docs = [] # doc_id -> doc text

def add(self, doc):
doc_id = len(self.docs)
self.docs.append(doc)
for term in tokenize(doc):
if not self.postings[term] or self.postings[term][-1] != doc_id:
self.postings[term].append(doc_id)

def search(self, query):
terms = tokenize(query)
if not terms: return []
results = set(self.postings.get(terms[0], []))
for term in terms[1:]:
results &= set(self.postings.get(term, []))
return list(results)

Evidence: Boolean query "alice rabbit" returns documents containing both terms.

Milestone 3: TF-IDF scoring

import math

def tf_idf(term, doc_id, index):
tf = sum(1 for t in tokenize(index.docs[doc_id]) if t == term)
df = len(index.postings.get(term, []))
if df == 0: return 0
n = len(index.docs)
return (1 + math.log(tf, 10)) * math.log(n / df, 10) if tf > 0 else 0

Sort results by sum of TF-IDF over query terms.

Evidence: Query "alice" against a Wikipedia corpus returns the Alice in Wonderland article above random alice-mentioning articles.

Milestone 4: BM25

The Robertson-Spärck Jones formula:

score(q, d) = Σ IDF(t) * ( (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * |d|/avgdl)) )

k1 ≈ 1.2, b ≈ 0.75 (default Elasticsearch). |d| = doc length, avgdl = average doc length.

BM25 differs from TF-IDF in two ways:

  1. Term frequency saturation — diminishing returns from repeating a term.
  2. Document length normalization — long documents shouldn't dominate just by having more words.
def bm25_score(query_terms, doc_id, index, k1=1.2, b=0.75):
score = 0
n = len(index.docs)
avgdl = sum(len(tokenize(d)) for d in index.docs) / n
doc = tokenize(index.docs[doc_id])
dl = len(doc)
for t in query_terms:
df = len(index.postings.get(t, []))
if df == 0: continue
idf = math.log((n - df + 0.5) / (df + 0.5) + 1)
tf = doc.count(t)
score += idf * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * dl / avgdl))
return score

Evidence: BM25 results visibly better than TF-IDF on a real corpus, especially for long documents.

Milestone 5: Normalization (stemming, stop words)

  • Stop words — remove "the", "and", "is", etc. Free precision; sometimes free recall.
  • Stemming — Porter stemmer reduces "running"/"runs"/"ran" to a common form.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

def normalize(text):
return [stemmer.stem(t) for t in tokenize(text) if t not in STOPWORDS]

Evidence: Query "running" matches documents with "ran" or "runs".

Milestone 6: Phrase queries

"alice in wonderland" should match consecutive terms, not just co-occurring terms. Store term positions in postings.

self.postings[term].append((doc_id, position))

Phrase match: for each candidate doc, check that positions of consecutive query terms differ by 1.

Evidence: Phrase query distinguishes "in alice wonderland" (no match) from "alice in wonderland" (match).

Milestone 7: Persistent on-disk index

Move the index to disk. Use sorted posting lists with binary-search lookup, or a B+ tree (or LevelDB) as the index store.

This is where you naturally use the KV store from earlier.

Evidence: Index 1 GB of text in N seconds. Query returns in milliseconds.

Embed each document into a ~768-dim vector (use a pre-trained model like all-MiniLM-L6-v2). Embed the query the same way. Compute cosine similarity.

For exact search: scan all documents. For approximate: HNSW or FAISS.

Evidence: Semantic query "transportation" matches documents about cars, trains, etc., even when the word "transportation" isn't in them.

Milestone 9 (optional): Hybrid ranking

Combine BM25 and vector scores. Either weighted sum or learning-to-rank.


8. Tests & evidence

TestHow
TokenizerHand-checked outputs for tricky inputs (URLs, contractions, numbers)
Inverted indexAfter indexing N docs, terms × avg_postings_size looks right
Boolean queryHand-traced examples
TF-IDF correctnessCompare against a hand-computed score for a tiny corpus
BM25 correctnessSame
Phrase queriesHand-traced examples
Quality on a real corpusPick 10 queries, judge top-10 results subjectively. Track precision@10.
PerformanceIndex a 1 GB corpus; query latency < 50 ms

The strongest evidence: a live search demo over a real corpus (a Wikipedia subset is fantastic), where reviewers can type queries.


9. Common pitfalls

  • Tokenization edge cases. "C++", "U.S.A.", "$100", "iPhone 14": each is a tokenizer decision. Document yours.
  • Stop words removal too aggressive. Removing "to" breaks "to be or not to be." Choose a small list, or use a stop-word list adapted to your corpus.
  • Stemming overaggressive. Porter stemming makes "university" and "universe" the same stem. Sometimes that's bad.
  • TF-IDF without log. Linear TF makes one frequent term dominate. Always use log-scaled TF.
  • BM25 parameter tuning. Defaults k1=1.2, b=0.75 are good. Don't tune them on a tiny dataset.
  • Phrase queries via post-filter. Filtering candidate docs is O(matches × phrase length). Use positional index for cleaner code.
  • Naive Python posting-list merge. Building Python sets and intersecting is correct but slow. Use sorted-list merge: O(n + m), not O(n × m).
  • Loading the entire index into RAM. Fine for a tutorial. Production engines memory-map files and access them as needed.

10. Extensions

  • Faceted search — filter by category, date range, tags.
  • Spelling correction — Levenshtein distance, candidate generation.
  • Autocomplete — prefix trie or finite-state transducer.
  • Highlighting — return matched snippets with <mark> tags around terms.
  • Distributed indexing — shard by hash(doc_id); query in parallel.
  • Learning to rank — train a model on click data; features include BM25, vector similarity, document age, page rank.
  • Click model logging — log every query and click, replay for relevance evaluation.
  • PageRank — for a corpus with links, add link-based ranking.

The path from "BM25 over docs" to "Elasticsearch" is long and well-documented; every step is interesting.


11. Module integration

ModuleWhat the search engine deepens
Sem 6 Module 2 — Storage & indexingInverted index is the textbook indexing problem.
Sem 7 Module 1 — ArchitectureThe pipeline (ingest → index → query → rank) is a classic layered architecture.
Sem 7 Module 2 — ModularityEach stage (tokenize, normalize, index, score) has a clean boundary.
KV store tutorialPosting lists stored in a KV.
Neural Network tutorialEmbedding-based search uses neural networks.
Regex Engine tutorialText-processing techniques overlap.

12. Portfolio framing

What to publish:

  • Source organized as tokenizer/, index/, score/, query/.
  • A live search demo over a published corpus (Wikipedia dump, Project Gutenberg, your blog archive).
  • A README with:
    • Index size, indexing time, query latency.
    • A relevance comparison: top-10 results for 10 queries.
    • Honest comparison vs Lucene/Elasticsearch on the same corpus.

Reviewer entry points:

  • index/inverted.py — the index implementation.
  • score/bm25.py — the ranking function.
  • query/parse.py — query parsing.
  • A web demo (Flask or FastAPI) where reviewers type queries.

A working search engine over a real corpus is one of the most satisfying portfolio pieces — it's interactive, the results are visible, and the underlying algorithms are deeply educational.


Source

This tutorial draws from the BYO-X catalog "Search Engine" section. Manning, Raghavan, Schütze's Introduction to Information Retrieval and Bart de Goede's repo are the canonical primary references.