A production-ready website monitoring and archival system built on ArchiveBox. Automatically discovers, snapshots, and tracks changes across websites with full-text search, cryptographic verification, and distributed storage capabilities.
This project implements an intelligent website watcher that combines ArchiveBox’s powerful archival capabilities with advanced change detection, distributed storage, and cryptographic verification. Perfect for compliance monitoring, research archival, competitive intelligence, or preserving important web content.
| Category | Technologies |
|---|---|
| Core | Python 3.10+, ArchiveBox, SQLite (FTS5) |
| Web | Flask, BeautifulSoup4, lxml, readability-lxml |
| Scheduling | APScheduler, Systemd timers |
| Crypto | PyNaCl, Cryptography, Merkle trees |
| Monitoring | Prometheus, Health checks |
| Storage | IPFS (optional), SQLite |
# Prerequisites: Install ArchiveBox
pip install archivebox && archivebox init
# Clone and install
git clone https://github.com/jayhemnani9910/webcrawler.git
cd webcrawler
pip install -r requirements.txt
# Add a site and run
python -m src.main add-site https://example.com
python -m src.main run
# Search archived content
python -m src.main search "query"
# Launch web UI
python -m src.main web # http://localhost:5000
docker compose up -d --build
# Production with Prometheus + Grafana
docker compose -f docker-compose.prod.yml up -d
sudo ./install_service.sh
sudo systemctl enable website-watcher.timer
sudo systemctl start website-watcher.timer
# Search endpoint
GET /api/search?q={query}
# Health check
GET /health
# Prometheus metrics
GET /metrics
webcrawler/
├── src/
│ ├── main.py # CLI and scheduler
│ ├── crawler.py # Discovery and orchestration
│ ├── archivebox_interface.py
│ ├── db.py # SQLite schema
│ ├── crypto.py # Cryptographic utilities
│ ├── merkle.py # Merkle tree implementation
│ └── ipfs_interface.py # IPFS storage layer
├── systemd/ # Service units
├── docker-compose.yml
└── scripts/backup_db.sh
MIT