3.3 KiB
sFetch
sFetch is a full-stack search engine prototype with a lightweight Google/DDG-inspired frontend, a FastAPI search API, and an async crawler that indexes pages into a local SQLite FTS5 database.
On first backend launch, sFetch downloads the latest Tranco top-site list, filters pornographic/adult domains, and seeds up to 1,000 non-adult sites if that seed has not already been recorded in the database.
Project Structure
sFetch/
├── backend/
│ ├── main.py
│ ├── crawler.py
│ ├── top_sites.py
│ ├── content_filter.py
│ ├── indexer.py
│ ├── searcher.py
│ ├── models.py
│ ├── database.py
│ ├── config.py
│ └── requirements.txt
├── frontend/
│ ├── index.html
│ └── results.html
└── README.md
Setup
-
Create a virtual environment and install the backend dependencies:
cd backend python3 -m venv venv source venv/bin/activate pip install -r requirements.txt -
Start the API:
uvicorn main:app --reload -
Open
frontend/index.htmlin your browser.
The frontend uses const API_BASE = "http://localhost:8000"; at the top of each page script.
Crawling
The home page has index controls for:
- seeding the top 1,000 non-adult sites
- launching a custom crawl with seed URLs, depth, per-domain page limits, and same-domain filtering
- viewing current index and seed status
You can also call the API directly:
curl -X POST "http://localhost:8000/crawl" \
-H "Content-Type: application/json" \
-d '{
"seed_urls": ["https://example.com"],
"max_depth": 2,
"max_pages_per_domain": 50,
"same_domain_only": true
}'
Seed the top-site list manually:
curl -X POST "http://localhost:8000/crawl/top-sites"
The crawler:
- respects
robots.txt - filters adult URLs and adult-heavy page text
- stays on the same domain by default
- avoids revisiting URLs
- indexes HTML pages, images, and videos into SQLite
- records top-site seeding completion in
app_meta
API Endpoints
| Method | Path | Purpose |
|---|---|---|
GET |
/ |
Health check |
GET |
/search |
Full-text search endpoint |
POST |
/crawl |
Start a custom background crawl job |
POST |
/crawl/top-sites |
Queue the top-site seed crawl |
GET |
/crawl/top-sites/status |
Check top-site seed state |
GET |
/stats |
Total indexed pages and latest index time |
Configuration
sFetch's crawl and storage behavior lives in backend/config.py:
| Setting | Description |
|---|---|
MAX_CRAWL_DEPTH |
Default link depth followed from each seed URL |
MAX_PAGES_PER_DOMAIN |
Default per-domain crawl cap |
CRAWL_DELAY_SECONDS |
Delay before requests |
DEFAULT_CRAWL_CONCURRENCY |
Concurrent fetch limit |
DB_PATH |
SQLite database path |
TOP_SITE_SOURCE_URL |
Top-site list source |
TOP_SITE_SEED_LIMIT |
Number of safe top sites to seed |
USER_AGENT |
User agent sent by sFetchBot |
Tech Stack
| Layer | Technology |
|---|---|
| Frontend | HTML, TailwindCSS CDN, Vanilla JavaScript |
| Backend | Python, FastAPI |
| Crawler | Python, httpx, BeautifulSoup4, asyncio |
| Search Index | SQLite FTS5 via aiosqlite |
| Top Sites | Tranco daily top-site ZIP with bundled fallback |