sFetch/README.md

# sFetch

sFetch is a full-stack search engine prototype with a serious search interface, a FastAPI search API, Ollama Cloud-powered AI answers, and an async crawler that indexes pages into a local SQLite FTS5 database.

On first backend launch, sFetch downloads the latest Tranco top-site list, filters pornographic/adult domains, and seeds up to 1,000 non-adult sites if that seed has not already been recorded in the database.

## Project Structure

```text
sFetch/
├── backend/
│   ├── main.py
│   ├── crawler.py
│   ├── ollama_cloud.py
│   ├── top_sites.py
│   ├── content_filter.py
│   ├── indexer.py
│   ├── searcher.py
│   ├── models.py
│   ├── database.py
│   ├── config.py
│   └── requirements.txt
├── frontend/
│   ├── index.html
│   ├── ai.html
│   └── results.html
└── README.md
```

## Setup

1. Create a virtual environment and install the backend dependencies:

   ```bash
   cd backend
   python3 -m venv venv
   source venv/bin/activate
   pip install -r requirements.txt
   ```

2. Start the API:

   ```bash
   uvicorn main:app --reload
   ```

3. Open `frontend/index.html` in your browser.

The frontend uses `const API_BASE = "http://localhost:8000";` at the top of each page script.

## Ollama Cloud AI

sFetch reads Ollama Cloud credentials from environment variables. Do not hardcode API keys into source files.

```bash
export OLLAMA_API_KEY=your_api_key
export OLLAMA_DEFAULT_MODEL=gpt-oss:120b
```

AI features:

- `GET /ai/models` loads all models currently returned by Ollama Cloud's `/api/tags`.
- `POST /ai/search` generates an AI answer for search results using local indexed results and optional Ollama web search context.
- `POST /ai/search/stream` streams a search-grounded answer as server-sent events.
- `POST /ai/chat` powers the dedicated AI chat page at `frontend/ai.html`, with model selection and optional web search context.
- `POST /ai/chat/stream` streams chat responses as server-sent events.

## Crawling

The home page has index controls for:

- seeding the top 1,000 non-adult sites
- launching a custom crawl with seed URLs, depth, per-domain page limits, and same-domain filtering
- viewing current index and seed status

You can also call the API directly:

```bash
curl -X POST "http://localhost:8000/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "seed_urls": ["https://example.com"],
    "max_depth": 2,
    "max_pages_per_domain": 50,
    "same_domain_only": true
  }'
```

Seed the top-site list manually:

```bash
curl -X POST "http://localhost:8000/crawl/top-sites"
```

The crawler:

- respects `robots.txt`
- filters adult URLs and adult-heavy page text
- stays on the same domain by default
- avoids revisiting URLs
- indexes HTML pages, images, and videos into SQLite
- records top-site seeding completion in `app_meta`

## API Endpoints

| Method | Path | Purpose |
| --- | --- | --- |
| `GET` | `/` | Health check |
| `GET` | `/search` | Full-text search endpoint |
| `POST` | `/crawl` | Start a custom background crawl job |
| `POST` | `/crawl/top-sites` | Queue the top-site seed crawl |
| `GET` | `/crawl/top-sites/status` | Check top-site seed state |
| `GET` | `/stats` | Total indexed pages and latest index time |
| `GET` | `/ai/config` | Check Ollama Cloud configuration |
| `GET` | `/ai/models` | List available Ollama Cloud models |
| `POST` | `/ai/search` | Generate an AI answer for a search query |
| `POST` | `/ai/search/stream` | Stream an AI answer for a search query |
| `POST` | `/ai/chat` | Generate an AI chat response |
| `POST` | `/ai/chat/stream` | Stream an AI chat response |

## Configuration

sFetch's crawl and storage behavior lives in `backend/config.py`:

| Setting | Description |
| --- | --- |
| `MAX_CRAWL_DEPTH` | Default link depth followed from each seed URL |
| `MAX_PAGES_PER_DOMAIN` | Default per-domain crawl cap |
| `CRAWL_DELAY_SECONDS` | Delay before requests |
| `DEFAULT_CRAWL_CONCURRENCY` | Concurrent fetch limit |
| `DB_PATH` | SQLite database path |
| `TOP_SITE_SOURCE_URL` | Top-site list source |
| `TOP_SITE_SEED_LIMIT` | Number of safe top sites to seed |
| `USER_AGENT` | User agent sent by `sFetchBot` |
| `OLLAMA_API_BASE` | Ollama Cloud API base URL |
| `OLLAMA_API_KEY` | API key used for authenticated Ollama Cloud calls |
| `OLLAMA_DEFAULT_MODEL` | Default model selected in AI features |

## Tech Stack

| Layer | Technology |
| --- | --- |
| Frontend | HTML, TailwindCSS CDN, Vanilla JavaScript |
| Backend | Python, FastAPI |
| AI | Ollama Cloud API |
| Crawler | Python, `httpx`, `BeautifulSoup4`, `asyncio` |
| Search Index | SQLite FTS5 via `aiosqlite` |
| Top Sites | Tranco daily top-site ZIP with bundled fallback |