sFetch

sFetch is a full-stack search engine prototype with a serious search interface, a FastAPI search API, Ollama Cloud-powered AI answers, and an async crawler that indexes pages into a local SQLite FTS5 database.

On first backend launch, sFetch downloads the latest Tranco top-site list, filters pornographic/adult domains, and seeds up to 1,000 non-adult sites if that seed has not already been recorded in the database.

Project Structure

sFetch/
├── backend/
│   ├── main.py
│   ├── crawler.py
│   ├── ollama_cloud.py
│   ├── top_sites.py
│   ├── content_filter.py
│   ├── indexer.py
│   ├── searcher.py
│   ├── models.py
│   ├── database.py
│   ├── config.py
│   └── requirements.txt
├── frontend/
│   ├── index.html
│   ├── ai.html
│   └── results.html
└── README.md

Setup

Create a virtual environment and install the backend dependencies:

cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Start the API:
```
uvicorn main:app --reload
```
Open frontend/index.html in your browser.

The frontend uses const API_BASE = "http://localhost:8000"; at the top of each page script.

Ollama Cloud AI

sFetch reads Ollama Cloud credentials from environment variables. Do not hardcode API keys into source files.

export OLLAMA_API_KEY=your_api_key
export OLLAMA_DEFAULT_MODEL=gpt-oss:120b

AI features:

GET /ai/models loads all models currently returned by Ollama Cloud's /api/tags.
POST /ai/search generates an AI answer for search results using local indexed results and optional Ollama web search context.
POST /ai/search/stream streams a search-grounded answer as server-sent events.
POST /ai/chat powers the dedicated AI chat page at frontend/ai.html, with model selection and optional web search context.
POST /ai/chat/stream streams chat responses as server-sent events.

Crawling

The home page has index controls for:

seeding the top 1,000 non-adult sites
launching a custom crawl with seed URLs, depth, per-domain page limits, and same-domain filtering
viewing current index and seed status

You can also call the API directly:

curl -X POST "http://localhost:8000/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "seed_urls": ["https://example.com"],
    "max_depth": 2,
    "max_pages_per_domain": 50,
    "same_domain_only": true
  }'

Seed the top-site list manually:

curl -X POST "http://localhost:8000/crawl/top-sites"

The crawler:

respects robots.txt
filters adult URLs and adult-heavy page text
stays on the same domain by default
avoids revisiting URLs
indexes HTML pages, images, and videos into SQLite
records top-site seeding completion in app_meta

API Endpoints

Method	Path	Purpose
`GET`	`/`	Health check
`GET`	`/search`	Full-text search endpoint
`POST`	`/crawl`	Start a custom background crawl job
`POST`	`/crawl/top-sites`	Queue the top-site seed crawl
`GET`	`/crawl/top-sites/status`	Check top-site seed state
`GET`	`/stats`	Total indexed pages and latest index time
`GET`	`/ai/config`	Check Ollama Cloud configuration
`GET`	`/ai/models`	List available Ollama Cloud models
`POST`	`/ai/search`	Generate an AI answer for a search query
`POST`	`/ai/search/stream`	Stream an AI answer for a search query
`POST`	`/ai/chat`	Generate an AI chat response
`POST`	`/ai/chat/stream`	Stream an AI chat response

Configuration

sFetch's crawl and storage behavior lives in backend/config.py:

Setting	Description
`MAX_CRAWL_DEPTH`	Default link depth followed from each seed URL
`MAX_PAGES_PER_DOMAIN`	Default per-domain crawl cap
`CRAWL_DELAY_SECONDS`	Delay before requests
`DEFAULT_CRAWL_CONCURRENCY`	Concurrent fetch limit
`DB_PATH`	SQLite database path
`TOP_SITE_SOURCE_URL`	Top-site list source
`TOP_SITE_SEED_LIMIT`	Number of safe top sites to seed
`USER_AGENT`	User agent sent by `sFetchBot`
`OLLAMA_API_BASE`	Ollama Cloud API base URL
`OLLAMA_API_KEY`	API key used for authenticated Ollama Cloud calls
`OLLAMA_DEFAULT_MODEL`	Default model selected in AI features

Tech Stack

Layer	Technology
Frontend	HTML, TailwindCSS CDN, Vanilla JavaScript
Backend	Python, FastAPI
AI	Ollama Cloud API
Crawler	Python, `httpx`, `BeautifulSoup4`, `asyncio`
Search Index	SQLite FTS5 via `aiosqlite`
Top Sites	Tranco daily top-site ZIP with bundled fallback

4.7 KiB Raw Permalink Blame History