Files

4.7 KiB

sFetch

sFetch is a full-stack search engine prototype with a serious search interface, a FastAPI search API, Ollama Cloud-powered AI answers, and an async crawler that indexes pages into a local SQLite FTS5 database.

On first backend launch, sFetch downloads the latest Tranco top-site list, filters pornographic/adult domains, and seeds up to 1,000 non-adult sites if that seed has not already been recorded in the database.

Project Structure

sFetch/
├── backend/
│   ├── main.py
│   ├── crawler.py
│   ├── ollama_cloud.py
│   ├── top_sites.py
│   ├── content_filter.py
│   ├── indexer.py
│   ├── searcher.py
│   ├── models.py
│   ├── database.py
│   ├── config.py
│   └── requirements.txt
├── frontend/
│   ├── index.html
│   ├── ai.html
│   └── results.html
└── README.md

Setup

  1. Create a virtual environment and install the backend dependencies:

    cd backend
    python3 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
    
  2. Start the API:

    uvicorn main:app --reload
    
  3. Open frontend/index.html in your browser.

The frontend uses const API_BASE = "http://localhost:8000"; at the top of each page script.

Ollama Cloud AI

sFetch reads Ollama Cloud credentials from environment variables. Do not hardcode API keys into source files.

export OLLAMA_API_KEY=your_api_key
export OLLAMA_DEFAULT_MODEL=gpt-oss:120b

AI features:

  • GET /ai/models loads all models currently returned by Ollama Cloud's /api/tags.
  • POST /ai/search generates an AI answer for search results using local indexed results and optional Ollama web search context.
  • POST /ai/search/stream streams a search-grounded answer as server-sent events.
  • POST /ai/chat powers the dedicated AI chat page at frontend/ai.html, with model selection and optional web search context.
  • POST /ai/chat/stream streams chat responses as server-sent events.

Crawling

The home page has index controls for:

  • seeding the top 1,000 non-adult sites
  • launching a custom crawl with seed URLs, depth, per-domain page limits, and same-domain filtering
  • viewing current index and seed status

You can also call the API directly:

curl -X POST "http://localhost:8000/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "seed_urls": ["https://example.com"],
    "max_depth": 2,
    "max_pages_per_domain": 50,
    "same_domain_only": true
  }'

Seed the top-site list manually:

curl -X POST "http://localhost:8000/crawl/top-sites"

The crawler:

  • respects robots.txt
  • filters adult URLs and adult-heavy page text
  • stays on the same domain by default
  • avoids revisiting URLs
  • indexes HTML pages, images, and videos into SQLite
  • records top-site seeding completion in app_meta

API Endpoints

Method Path Purpose
GET / Health check
GET /search Full-text search endpoint
POST /crawl Start a custom background crawl job
POST /crawl/top-sites Queue the top-site seed crawl
GET /crawl/top-sites/status Check top-site seed state
GET /stats Total indexed pages and latest index time
GET /ai/config Check Ollama Cloud configuration
GET /ai/models List available Ollama Cloud models
POST /ai/search Generate an AI answer for a search query
POST /ai/search/stream Stream an AI answer for a search query
POST /ai/chat Generate an AI chat response
POST /ai/chat/stream Stream an AI chat response

Configuration

sFetch's crawl and storage behavior lives in backend/config.py:

Setting Description
MAX_CRAWL_DEPTH Default link depth followed from each seed URL
MAX_PAGES_PER_DOMAIN Default per-domain crawl cap
CRAWL_DELAY_SECONDS Delay before requests
DEFAULT_CRAWL_CONCURRENCY Concurrent fetch limit
DB_PATH SQLite database path
TOP_SITE_SOURCE_URL Top-site list source
TOP_SITE_SEED_LIMIT Number of safe top sites to seed
USER_AGENT User agent sent by sFetchBot
OLLAMA_API_BASE Ollama Cloud API base URL
OLLAMA_API_KEY API key used for authenticated Ollama Cloud calls
OLLAMA_DEFAULT_MODEL Default model selected in AI features

Tech Stack

Layer Technology
Frontend HTML, TailwindCSS CDN, Vanilla JavaScript
Backend Python, FastAPI
AI Ollama Cloud API
Crawler Python, httpx, BeautifulSoup4, asyncio
Search Index SQLite FTS5 via aiosqlite
Top Sites Tranco daily top-site ZIP with bundled fallback