# sFetch sFetch is a full-stack search engine prototype with a lightweight Google/DDG-inspired frontend, a FastAPI search API, and an async crawler that indexes pages into a local SQLite FTS5 database. On first backend launch, sFetch downloads the latest Tranco top-site list, filters pornographic/adult domains, and seeds up to 1,000 non-adult sites if that seed has not already been recorded in the database. ## Project Structure ```text sFetch/ ├── backend/ │ ├── main.py │ ├── crawler.py │ ├── top_sites.py │ ├── content_filter.py │ ├── indexer.py │ ├── searcher.py │ ├── models.py │ ├── database.py │ ├── config.py │ └── requirements.txt ├── frontend/ │ ├── index.html │ └── results.html └── README.md ``` ## Setup 1. Create a virtual environment and install the backend dependencies: ```bash cd backend python3 -m venv venv source venv/bin/activate pip install -r requirements.txt ``` 2. Start the API: ```bash uvicorn main:app --reload ``` 3. Open `frontend/index.html` in your browser. The frontend uses `const API_BASE = "http://localhost:8000";` at the top of each page script. ## Crawling The home page has index controls for: - seeding the top 1,000 non-adult sites - launching a custom crawl with seed URLs, depth, per-domain page limits, and same-domain filtering - viewing current index and seed status You can also call the API directly: ```bash curl -X POST "http://localhost:8000/crawl" \ -H "Content-Type: application/json" \ -d '{ "seed_urls": ["https://example.com"], "max_depth": 2, "max_pages_per_domain": 50, "same_domain_only": true }' ``` Seed the top-site list manually: ```bash curl -X POST "http://localhost:8000/crawl/top-sites" ``` The crawler: - respects `robots.txt` - filters adult URLs and adult-heavy page text - stays on the same domain by default - avoids revisiting URLs - indexes HTML pages, images, and videos into SQLite - records top-site seeding completion in `app_meta` ## API Endpoints | Method | Path | Purpose | | --- | --- | --- | | `GET` | `/` | Health check | | `GET` | `/search` | Full-text search endpoint | | `POST` | `/crawl` | Start a custom background crawl job | | `POST` | `/crawl/top-sites` | Queue the top-site seed crawl | | `GET` | `/crawl/top-sites/status` | Check top-site seed state | | `GET` | `/stats` | Total indexed pages and latest index time | ## Configuration sFetch's crawl and storage behavior lives in `backend/config.py`: | Setting | Description | | --- | --- | | `MAX_CRAWL_DEPTH` | Default link depth followed from each seed URL | | `MAX_PAGES_PER_DOMAIN` | Default per-domain crawl cap | | `CRAWL_DELAY_SECONDS` | Delay before requests | | `DEFAULT_CRAWL_CONCURRENCY` | Concurrent fetch limit | | `DB_PATH` | SQLite database path | | `TOP_SITE_SOURCE_URL` | Top-site list source | | `TOP_SITE_SEED_LIMIT` | Number of safe top sites to seed | | `USER_AGENT` | User agent sent by `sFetchBot` | ## Tech Stack | Layer | Technology | | --- | --- | | Frontend | HTML, TailwindCSS CDN, Vanilla JavaScript | | Backend | Python, FastAPI | | Crawler | Python, `httpx`, `BeautifulSoup4`, `asyncio` | | Search Index | SQLite FTS5 via `aiosqlite` | | Top Sites | Tranco daily top-site ZIP with bundled fallback |