# sFetch sFetch is a full-stack search engine prototype with a serious search interface, a FastAPI search API, Ollama Cloud-powered AI answers, and an async crawler that indexes pages into a local SQLite FTS5 database. On first backend launch, sFetch downloads the latest Tranco top-site list, filters pornographic/adult domains, and seeds up to 1,000 non-adult sites if that seed has not already been recorded in the database. ## Project Structure ```text sFetch/ ├── backend/ │ ├── main.py │ ├── crawler.py │ ├── ollama_cloud.py │ ├── top_sites.py │ ├── content_filter.py │ ├── indexer.py │ ├── searcher.py │ ├── models.py │ ├── database.py │ ├── config.py │ └── requirements.txt ├── frontend/ │ ├── index.html │ ├── ai.html │ └── results.html └── README.md ``` ## Setup 1. Create a virtual environment and install the backend dependencies: ```bash cd backend python3 -m venv venv source venv/bin/activate pip install -r requirements.txt ``` 2. Start the API: ```bash uvicorn main:app --reload ``` 3. Open `frontend/index.html` in your browser. The frontend uses `const API_BASE = "http://localhost:8000";` at the top of each page script. ## Ollama Cloud AI sFetch reads Ollama Cloud credentials from environment variables. Do not hardcode API keys into source files. ```bash export OLLAMA_API_KEY=your_api_key export OLLAMA_DEFAULT_MODEL=gpt-oss:120b ``` AI features: - `GET /ai/models` loads all models currently returned by Ollama Cloud's `/api/tags`. - `POST /ai/search` generates an AI answer for search results using local indexed results and optional Ollama web search context. - `POST /ai/search/stream` streams a search-grounded answer as server-sent events. - `POST /ai/chat` powers the dedicated AI chat page at `frontend/ai.html`, with model selection and optional web search context. - `POST /ai/chat/stream` streams chat responses as server-sent events. ## Crawling The home page has index controls for: - seeding the top 1,000 non-adult sites - launching a custom crawl with seed URLs, depth, per-domain page limits, and same-domain filtering - viewing current index and seed status You can also call the API directly: ```bash curl -X POST "http://localhost:8000/crawl" \ -H "Content-Type: application/json" \ -d '{ "seed_urls": ["https://example.com"], "max_depth": 2, "max_pages_per_domain": 50, "same_domain_only": true }' ``` Seed the top-site list manually: ```bash curl -X POST "http://localhost:8000/crawl/top-sites" ``` The crawler: - respects `robots.txt` - filters adult URLs and adult-heavy page text - stays on the same domain by default - avoids revisiting URLs - indexes HTML pages, images, and videos into SQLite - records top-site seeding completion in `app_meta` ## API Endpoints | Method | Path | Purpose | | --- | --- | --- | | `GET` | `/` | Health check | | `GET` | `/search` | Full-text search endpoint | | `POST` | `/crawl` | Start a custom background crawl job | | `POST` | `/crawl/top-sites` | Queue the top-site seed crawl | | `GET` | `/crawl/top-sites/status` | Check top-site seed state | | `GET` | `/stats` | Total indexed pages and latest index time | | `GET` | `/ai/config` | Check Ollama Cloud configuration | | `GET` | `/ai/models` | List available Ollama Cloud models | | `POST` | `/ai/search` | Generate an AI answer for a search query | | `POST` | `/ai/search/stream` | Stream an AI answer for a search query | | `POST` | `/ai/chat` | Generate an AI chat response | | `POST` | `/ai/chat/stream` | Stream an AI chat response | ## Configuration sFetch's crawl and storage behavior lives in `backend/config.py`: | Setting | Description | | --- | --- | | `MAX_CRAWL_DEPTH` | Default link depth followed from each seed URL | | `MAX_PAGES_PER_DOMAIN` | Default per-domain crawl cap | | `CRAWL_DELAY_SECONDS` | Delay before requests | | `DEFAULT_CRAWL_CONCURRENCY` | Concurrent fetch limit | | `DB_PATH` | SQLite database path | | `TOP_SITE_SOURCE_URL` | Top-site list source | | `TOP_SITE_SEED_LIMIT` | Number of safe top sites to seed | | `USER_AGENT` | User agent sent by `sFetchBot` | | `OLLAMA_API_BASE` | Ollama Cloud API base URL | | `OLLAMA_API_KEY` | API key used for authenticated Ollama Cloud calls | | `OLLAMA_DEFAULT_MODEL` | Default model selected in AI features | ## Tech Stack | Layer | Technology | | --- | --- | | Frontend | HTML, TailwindCSS CDN, Vanilla JavaScript | | Backend | Python, FastAPI | | AI | Ollama Cloud API | | Crawler | Python, `httpx`, `BeautifulSoup4`, `asyncio` | | Search Index | SQLite FTS5 via `aiosqlite` | | Top Sites | Tranco daily top-site ZIP with bundled fallback |