sFetch
sFetch is a full-stack search engine prototype with a serious search interface, a FastAPI search API, Ollama Cloud-powered AI answers, and an async crawler that indexes pages into a local SQLite FTS5 database.
On first backend launch, sFetch downloads the latest Tranco top-site list, filters pornographic/adult domains, and seeds up to 1,000 non-adult sites if that seed has not already been recorded in the database.
Project Structure
sFetch/
├── backend/
│ ├── main.py
│ ├── crawler.py
│ ├── ollama_cloud.py
│ ├── top_sites.py
│ ├── content_filter.py
│ ├── indexer.py
│ ├── searcher.py
│ ├── models.py
│ ├── database.py
│ ├── config.py
│ └── requirements.txt
├── frontend/
│ ├── index.html
│ ├── ai.html
│ └── results.html
└── README.md
Setup
-
Create a virtual environment and install the backend dependencies:
cd backend python3 -m venv venv source venv/bin/activate pip install -r requirements.txt -
Start the API:
uvicorn main:app --reload -
Open
frontend/index.htmlin your browser.
The frontend uses const API_BASE = "http://localhost:8000"; at the top of each page script.
Ollama Cloud AI
sFetch reads Ollama Cloud credentials from environment variables. Do not hardcode API keys into source files.
export OLLAMA_API_KEY=your_api_key
export OLLAMA_DEFAULT_MODEL=gpt-oss:120b
AI features:
GET /ai/modelsloads all models currently returned by Ollama Cloud's/api/tags.POST /ai/searchgenerates an AI answer for search results using local indexed results and optional Ollama web search context.POST /ai/search/streamstreams a search-grounded answer as server-sent events.POST /ai/chatpowers the dedicated AI chat page atfrontend/ai.html, with model selection and optional web search context.POST /ai/chat/streamstreams chat responses as server-sent events.
Crawling
The home page has index controls for:
- seeding the top 1,000 non-adult sites
- launching a custom crawl with seed URLs, depth, per-domain page limits, and same-domain filtering
- viewing current index and seed status
You can also call the API directly:
curl -X POST "http://localhost:8000/crawl" \
-H "Content-Type: application/json" \
-d '{
"seed_urls": ["https://example.com"],
"max_depth": 2,
"max_pages_per_domain": 50,
"same_domain_only": true
}'
Seed the top-site list manually:
curl -X POST "http://localhost:8000/crawl/top-sites"
The crawler:
- respects
robots.txt - filters adult URLs and adult-heavy page text
- stays on the same domain by default
- avoids revisiting URLs
- indexes HTML pages, images, and videos into SQLite
- records top-site seeding completion in
app_meta
API Endpoints
| Method | Path | Purpose |
|---|---|---|
GET |
/ |
Health check |
GET |
/search |
Full-text search endpoint |
POST |
/crawl |
Start a custom background crawl job |
POST |
/crawl/top-sites |
Queue the top-site seed crawl |
GET |
/crawl/top-sites/status |
Check top-site seed state |
GET |
/stats |
Total indexed pages and latest index time |
GET |
/ai/config |
Check Ollama Cloud configuration |
GET |
/ai/models |
List available Ollama Cloud models |
POST |
/ai/search |
Generate an AI answer for a search query |
POST |
/ai/search/stream |
Stream an AI answer for a search query |
POST |
/ai/chat |
Generate an AI chat response |
POST |
/ai/chat/stream |
Stream an AI chat response |
Configuration
sFetch's crawl and storage behavior lives in backend/config.py:
| Setting | Description |
|---|---|
MAX_CRAWL_DEPTH |
Default link depth followed from each seed URL |
MAX_PAGES_PER_DOMAIN |
Default per-domain crawl cap |
CRAWL_DELAY_SECONDS |
Delay before requests |
DEFAULT_CRAWL_CONCURRENCY |
Concurrent fetch limit |
DB_PATH |
SQLite database path |
TOP_SITE_SOURCE_URL |
Top-site list source |
TOP_SITE_SEED_LIMIT |
Number of safe top sites to seed |
USER_AGENT |
User agent sent by sFetchBot |
OLLAMA_API_BASE |
Ollama Cloud API base URL |
OLLAMA_API_KEY |
API key used for authenticated Ollama Cloud calls |
OLLAMA_DEFAULT_MODEL |
Default model selected in AI features |
Tech Stack
| Layer | Technology |
|---|---|
| Frontend | HTML, TailwindCSS CDN, Vanilla JavaScript |
| Backend | Python, FastAPI |
| AI | Ollama Cloud API |
| Crawler | Python, httpx, BeautifulSoup4, asyncio |
| Search Index | SQLite FTS5 via aiosqlite |
| Top Sites | Tranco daily top-site ZIP with bundled fallback |