sFetch

sFetch is a full-stack search engine prototype with a lightweight Google/DDG-inspired frontend, a FastAPI search API, and an async crawler that indexes pages into a local SQLite FTS5 database.

On first backend launch, sFetch downloads the latest Tranco top-site list, filters pornographic/adult domains, and seeds up to 1,000 non-adult sites if that seed has not already been recorded in the database.

Project Structure

sFetch/
├── backend/
│   ├── main.py
│   ├── crawler.py
│   ├── top_sites.py
│   ├── content_filter.py
│   ├── indexer.py
│   ├── searcher.py
│   ├── models.py
│   ├── database.py
│   ├── config.py
│   └── requirements.txt
├── frontend/
│   ├── index.html
│   └── results.html
└── README.md

Setup

Create a virtual environment and install the backend dependencies:

cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Start the API:
```
uvicorn main:app --reload
```
Open frontend/index.html in your browser.

The frontend uses const API_BASE = "http://localhost:8000"; at the top of each page script.

Crawling

The home page has index controls for:

seeding the top 1,000 non-adult sites
launching a custom crawl with seed URLs, depth, per-domain page limits, and same-domain filtering
viewing current index and seed status

You can also call the API directly:

curl -X POST "http://localhost:8000/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "seed_urls": ["https://example.com"],
    "max_depth": 2,
    "max_pages_per_domain": 50,
    "same_domain_only": true
  }'

Seed the top-site list manually:

curl -X POST "http://localhost:8000/crawl/top-sites"

The crawler:

respects robots.txt
filters adult URLs and adult-heavy page text
stays on the same domain by default
avoids revisiting URLs
indexes HTML pages, images, and videos into SQLite
records top-site seeding completion in app_meta

API Endpoints

Method	Path	Purpose
`GET`	`/`	Health check
`GET`	`/search`	Full-text search endpoint
`POST`	`/crawl`	Start a custom background crawl job
`POST`	`/crawl/top-sites`	Queue the top-site seed crawl
`GET`	`/crawl/top-sites/status`	Check top-site seed state
`GET`	`/stats`	Total indexed pages and latest index time

Configuration

sFetch's crawl and storage behavior lives in backend/config.py:

Setting	Description
`MAX_CRAWL_DEPTH`	Default link depth followed from each seed URL
`MAX_PAGES_PER_DOMAIN`	Default per-domain crawl cap
`CRAWL_DELAY_SECONDS`	Delay before requests
`DEFAULT_CRAWL_CONCURRENCY`	Concurrent fetch limit
`DB_PATH`	SQLite database path
`TOP_SITE_SOURCE_URL`	Top-site list source
`TOP_SITE_SEED_LIMIT`	Number of safe top sites to seed
`USER_AGENT`	User agent sent by `sFetchBot`

Tech Stack

Layer	Technology
Frontend	HTML, TailwindCSS CDN, Vanilla JavaScript
Backend	Python, FastAPI
Crawler	Python, `httpx`, `BeautifulSoup4`, `asyncio`
Search Index	SQLite FTS5 via `aiosqlite`
Top Sites	Tranco daily top-site ZIP with bundled fallback

3.3 KiB Raw Blame History