inital commit
This commit is contained in:
@@ -0,0 +1,119 @@
|
||||
# sFetch
|
||||
|
||||
sFetch is a full-stack search engine prototype with a lightweight Google/DDG-inspired frontend, a FastAPI search API, and an async crawler that indexes pages into a local SQLite FTS5 database.
|
||||
|
||||
On first backend launch, sFetch downloads the latest Tranco top-site list, filters pornographic/adult domains, and seeds up to 1,000 non-adult sites if that seed has not already been recorded in the database.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
sFetch/
|
||||
├── backend/
|
||||
│ ├── main.py
|
||||
│ ├── crawler.py
|
||||
│ ├── top_sites.py
|
||||
│ ├── content_filter.py
|
||||
│ ├── indexer.py
|
||||
│ ├── searcher.py
|
||||
│ ├── models.py
|
||||
│ ├── database.py
|
||||
│ ├── config.py
|
||||
│ └── requirements.txt
|
||||
├── frontend/
|
||||
│ ├── index.html
|
||||
│ └── results.html
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Setup
|
||||
|
||||
1. Create a virtual environment and install the backend dependencies:
|
||||
|
||||
```bash
|
||||
cd backend
|
||||
python3 -m venv venv
|
||||
source venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
2. Start the API:
|
||||
|
||||
```bash
|
||||
uvicorn main:app --reload
|
||||
```
|
||||
|
||||
3. Open `frontend/index.html` in your browser.
|
||||
|
||||
The frontend uses `const API_BASE = "http://localhost:8000";` at the top of each page script.
|
||||
|
||||
## Crawling
|
||||
|
||||
The home page has index controls for:
|
||||
|
||||
- seeding the top 1,000 non-adult sites
|
||||
- launching a custom crawl with seed URLs, depth, per-domain page limits, and same-domain filtering
|
||||
- viewing current index and seed status
|
||||
|
||||
You can also call the API directly:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8000/crawl" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"seed_urls": ["https://example.com"],
|
||||
"max_depth": 2,
|
||||
"max_pages_per_domain": 50,
|
||||
"same_domain_only": true
|
||||
}'
|
||||
```
|
||||
|
||||
Seed the top-site list manually:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8000/crawl/top-sites"
|
||||
```
|
||||
|
||||
The crawler:
|
||||
|
||||
- respects `robots.txt`
|
||||
- filters adult URLs and adult-heavy page text
|
||||
- stays on the same domain by default
|
||||
- avoids revisiting URLs
|
||||
- indexes HTML pages, images, and videos into SQLite
|
||||
- records top-site seeding completion in `app_meta`
|
||||
|
||||
## API Endpoints
|
||||
|
||||
| Method | Path | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `GET` | `/` | Health check |
|
||||
| `GET` | `/search` | Full-text search endpoint |
|
||||
| `POST` | `/crawl` | Start a custom background crawl job |
|
||||
| `POST` | `/crawl/top-sites` | Queue the top-site seed crawl |
|
||||
| `GET` | `/crawl/top-sites/status` | Check top-site seed state |
|
||||
| `GET` | `/stats` | Total indexed pages and latest index time |
|
||||
|
||||
## Configuration
|
||||
|
||||
sFetch's crawl and storage behavior lives in `backend/config.py`:
|
||||
|
||||
| Setting | Description |
|
||||
| --- | --- |
|
||||
| `MAX_CRAWL_DEPTH` | Default link depth followed from each seed URL |
|
||||
| `MAX_PAGES_PER_DOMAIN` | Default per-domain crawl cap |
|
||||
| `CRAWL_DELAY_SECONDS` | Delay before requests |
|
||||
| `DEFAULT_CRAWL_CONCURRENCY` | Concurrent fetch limit |
|
||||
| `DB_PATH` | SQLite database path |
|
||||
| `TOP_SITE_SOURCE_URL` | Top-site list source |
|
||||
| `TOP_SITE_SEED_LIMIT` | Number of safe top sites to seed |
|
||||
| `USER_AGENT` | User agent sent by `sFetchBot` |
|
||||
|
||||
## Tech Stack
|
||||
|
||||
| Layer | Technology |
|
||||
| --- | --- |
|
||||
| Frontend | HTML, TailwindCSS CDN, Vanilla JavaScript |
|
||||
| Backend | Python, FastAPI |
|
||||
| Crawler | Python, `httpx`, `BeautifulSoup4`, `asyncio` |
|
||||
| Search Index | SQLite FTS5 via `aiosqlite` |
|
||||
| Top Sites | Tranco daily top-site ZIP with bundled fallback |
|
||||
Reference in New Issue
Block a user