Files
sFetch/README.md
T
Ned Halksworth e0f2eedcd9 inital commit
2026-05-04 19:31:46 +01:00

3.3 KiB

sFetch

sFetch is a full-stack search engine prototype with a lightweight Google/DDG-inspired frontend, a FastAPI search API, and an async crawler that indexes pages into a local SQLite FTS5 database.

On first backend launch, sFetch downloads the latest Tranco top-site list, filters pornographic/adult domains, and seeds up to 1,000 non-adult sites if that seed has not already been recorded in the database.

Project Structure

sFetch/
├── backend/
│   ├── main.py
│   ├── crawler.py
│   ├── top_sites.py
│   ├── content_filter.py
│   ├── indexer.py
│   ├── searcher.py
│   ├── models.py
│   ├── database.py
│   ├── config.py
│   └── requirements.txt
├── frontend/
│   ├── index.html
│   └── results.html
└── README.md

Setup

  1. Create a virtual environment and install the backend dependencies:

    cd backend
    python3 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
    
  2. Start the API:

    uvicorn main:app --reload
    
  3. Open frontend/index.html in your browser.

The frontend uses const API_BASE = "http://localhost:8000"; at the top of each page script.

Crawling

The home page has index controls for:

  • seeding the top 1,000 non-adult sites
  • launching a custom crawl with seed URLs, depth, per-domain page limits, and same-domain filtering
  • viewing current index and seed status

You can also call the API directly:

curl -X POST "http://localhost:8000/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "seed_urls": ["https://example.com"],
    "max_depth": 2,
    "max_pages_per_domain": 50,
    "same_domain_only": true
  }'

Seed the top-site list manually:

curl -X POST "http://localhost:8000/crawl/top-sites"

The crawler:

  • respects robots.txt
  • filters adult URLs and adult-heavy page text
  • stays on the same domain by default
  • avoids revisiting URLs
  • indexes HTML pages, images, and videos into SQLite
  • records top-site seeding completion in app_meta

API Endpoints

Method Path Purpose
GET / Health check
GET /search Full-text search endpoint
POST /crawl Start a custom background crawl job
POST /crawl/top-sites Queue the top-site seed crawl
GET /crawl/top-sites/status Check top-site seed state
GET /stats Total indexed pages and latest index time

Configuration

sFetch's crawl and storage behavior lives in backend/config.py:

Setting Description
MAX_CRAWL_DEPTH Default link depth followed from each seed URL
MAX_PAGES_PER_DOMAIN Default per-domain crawl cap
CRAWL_DELAY_SECONDS Delay before requests
DEFAULT_CRAWL_CONCURRENCY Concurrent fetch limit
DB_PATH SQLite database path
TOP_SITE_SOURCE_URL Top-site list source
TOP_SITE_SEED_LIMIT Number of safe top sites to seed
USER_AGENT User agent sent by sFetchBot

Tech Stack

Layer Technology
Frontend HTML, TailwindCSS CDN, Vanilla JavaScript
Backend Python, FastAPI
Crawler Python, httpx, BeautifulSoup4, asyncio
Search Index SQLite FTS5 via aiosqlite
Top Sites Tranco daily top-site ZIP with bundled fallback