Skip to content

How it works

Dead Simple Search is intentionally simple. This page explains what happens behind the scenes when you crawl a site and search it, with enough detail to help you understand the codebase if you want to contribute or customize it.

The big picture

The system has four main parts:

┌─────────────┐     ┌──────────────┐     ┌──────────┐     ┌──────────┐
│  Flask API  │────▶│   Crawler    │────▶│  Indexer  │────▶│  MySQL   │
│  (app.py)   │     │ (crawler.py) │     │(indexer.py│     │ Database │
└─────────────┘     └──────────────┘     └──────────┘     └──────────┘
       │                                                        ▲
       │              ┌──────────────┐                          │
       └─────────────▶│   Search     │──────────────────────────┘
                      │ (search.py)  │
                      └──────────────┘

The API is the front door — it receives your requests and coordinates everything. The Crawler goes out and fetches web pages. The Indexer extracts useful content from those pages and stores it. The Search module queries the database and returns ranked results.

The crawl process

When you trigger a crawl, here's what happens step by step:

1. Robots.txt check

The crawler first fetches robots.txt from the target domain. This is a text file that website owners use to tell bots which parts of their site are off-limits. Dead Simple Search respects these rules — if a page is marked as "don't crawl," it won't be crawled.

2. Sitemap discovery

Next, the crawler looks for sitemaps. A sitemap is an XML file that lists all the pages on a website — think of it as a table of contents. The crawler checks two places:

  • The robots.txt file itself, which can reference sitemaps
  • Common locations like /sitemap.xml and /sitemap_index.xml

If sitemaps are found, their URLs are used to seed the crawl queue. This is much more efficient than discovering pages only by following links.

3. Page-by-page crawling

The crawler works through a queue of URLs:

  • It fetches each page using an asynchronous HTTP client (this means it can handle network operations efficiently without blocking)
  • It only processes HTML pages — PDFs, images, and other files are skipped
  • It waits between requests (the "crawl delay") to avoid overwhelming the target server
  • It extracts links from each page and adds new, unvisited ones to the queue
  • It stays within the original domain — it won't follow links to other websites

4. Indexing

For each page, the indexer extracts:

  • Title — from the <title> HTML tag
  • Meta description — the short summary that often appears in search results
  • H1 and H2 headings — the main headings on the page
  • Body text — all the visible text, with scripts, navigation, and other non-content elements removed
  • Language — detected from the HTML lang attribute or guessed from the text itself

This data is stored in MySQL using an "upsert" pattern: if the page already exists in the database (based on its URL), the record is updated; otherwise, a new record is created.

The search process

When you send a search query:

  1. Mode detection — the search module checks if your query contains special operators like +, -, or ". If it does, it uses MySQL's "boolean mode," which gives you more control. Otherwise, it uses "natural language mode," which is simpler and works well for everyday searches.

  2. Full-text matching — MySQL's FULLTEXT index is the engine behind the search. It searches across the title, meta description, headings, and body text simultaneously. MySQL calculates a relevance score for each matching page.

  3. Ranking — results are sorted by relevance, best matches first. The relevance score takes into account things like how often the search terms appear and where they appear (a match in the title is worth more than a match buried in the body text).

  4. Pagination — results are returned in pages (20 results at a time by default, configurable up to 100).

The database

Dead Simple Search uses three tables:

sites — one row per registered website. Stores the domain, start URL, and whether automatic crawling is enabled.

pages — one row per indexed page. This is where all the extracted content lives. It has a FULLTEXT index across the title, description, headings, and body text, which is what makes search fast.

crawl_log — a history of all crawl runs. Each entry records when the crawl started and finished, how many pages were crawled, and whether it succeeded.

File structure

The codebase is small — about 800 lines of Python across 7 files:

File Purpose
app.py The Flask web application and all API endpoints
config.py Configuration via environment variables
database.py MySQL connection pool and table creation
crawler.py The async web crawler
indexer.py HTML parsing and database storage
sitemap.py Sitemap discovery and XML parsing
search.py Full-text search logic
scheduler.py Optional scheduled re-crawling

Design principles

A few principles guide the project:

Boring technology. Python, Flask, and MySQL are mature, well-documented, and widely supported. You can find help on any search engine, forum, or chat room.

No magic. All SQL is hand-written. There's no ORM (object-relational mapper) hiding what's happening. When you read the code, you see exactly what queries are running.

Small surface area. The codebase is deliberately small. Every module fits on a screen or two. There are no deep abstraction layers to navigate.

Pragmatic trade-offs. The snippet in search results is the first 300 characters of body text — not a contextual window around the matched terms. Is that ideal? No. Is it simple, fast, and good enough for most cases? Yes.