2025-10-07 14:47:55 +11:00

314 lines
10 KiB
Markdown

# Epstein Files Archive
An automatically processed, OCR'd, searchable archive of publicly released documents related to the Jeffrey Epstein case.
## About
This project automatically processes thousands of scanned document pages using AI-powered OCR to:
- Extract and preserve all text (printed and handwritten)
- Identify and index entities (people, organizations, locations, dates)
- Reconstruct multi-page documents from individual scans
- Provide a searchable web interface to explore the archive
**This is a public service project.** All documents are from public releases. This archive makes them more accessible and searchable.
## Features
- **Full OCR**: Extracts both printed and handwritten text from all documents
- **Entity Extraction**: Automatically identifies and indexes:
- People mentioned
- Organizations
- Locations
- Dates
- Reference numbers
- **Entity Deduplication**: AI-powered merging of duplicate entities (e.g., "Epstein" → "Jeffrey Epstein")
- **AI Document Analysis**: Generates summaries, key topics, key people, and significance for each document
- **Document Reconstruction**: Groups scanned pages back into complete documents
- **Searchable Interface**: Browse by person, organization, location, date, or document type
- **Static Site**: Fast, lightweight, works anywhere
## Project Structure
```
.
├── process_images.py # Python script to OCR images using AI
├── cleanup_failed.py # Python script to clean up failed processing
├── deduplicate.py # Python script to deduplicate entities
├── deduplicate_types.py # Python script to deduplicate document types
├── analyze_documents.py # Python script to generate AI summaries
├── requirements.txt # Python dependencies
├── .env.example # Example environment configuration
├── downloads/ # Place document images here
├── results/ # Extracted JSON data per document
├── processing_index.json # Processing progress tracking (generated)
├── dedupe.json # Entity deduplication mappings (generated)
├── dedupe_types.json # Document type deduplication mappings (generated)
├── analyses.json # AI document analyses (generated)
├── src/ # 11ty source files for website
├── .eleventy.js # Static site generator configuration
└── _site/ # Generated static website (after build)
```
## Setup
### 1. Install Dependencies
**Python (for OCR processing):**
```bash
pip install -r requirements.txt
```
**Node.js (for website generation):**
```bash
npm install
```
### 2. Configure API
Copy `.env.example` to `.env` and configure your OpenAI-compatible API endpoint:
```bash
cp .env.example .env
# Edit .env with your API details
```
### 3. Process Documents
Place document images in the `downloads/` directory, then run:
```bash
python process_images.py
# Options:
# --limit N # Process only N images (for testing)
# --workers N # Number of parallel workers (default: 5)
# --no-resume # Process all files, ignore index
```
The script will:
- Process each image through the OCR API
- Extract text, entities, and metadata
- **Auto-fix broken JSON**: If the LLM returns invalid JSON, the script automatically sends the error back to the LLM along with the original image to get a corrected response
- Save results to `./results/{folder}/{imagename}.json`
- Track progress in `processing_index.json` (resume-friendly)
- Log failed files for later cleanup
**If processing fails or you need to retry failed files:**
```bash
# Check for failures (dry run)
python cleanup_failed.py
# Remove failed files from processed list (so they can be retried)
python cleanup_failed.py --doit
# Also delete corrupt JSON files
python cleanup_failed.py --doit --delete-invalid-json
```
### 4. Deduplicate Entities (Optional but Recommended)
The LLM may extract the same entity with different spellings (e.g., "Epstein", "Jeffrey Epstein", "J. Epstein"). Run the deduplication script to merge these:
```bash
python deduplicate.py
# Options:
# --batch-size N # Process N entities per batch (default: 50)
# --show-stats # Show deduplication stats without processing
```
This will:
- Scan all JSON files in `./results/`
- Use AI to identify duplicate entities across people, organizations, and locations
- Create a `dedupe.json` mapping file
- The website build will automatically use this mapping
**Example dedupe.json:**
```json
{
"people": {
"Epstein": "Jeffrey Epstein",
"J. Epstein": "Jeffrey Epstein",
"Jeffrey Epstein": "Jeffrey Epstein"
},
"organizations": {...},
"locations": {...}
}
```
**Deduplicate Document Types:**
The LLM may also extract document types with inconsistent formatting (e.g., "deposition", "Deposition", "DEPOSITION TRANSCRIPT"). Run the type deduplication script:
```bash
python deduplicate_types.py
```
This will:
- Collect all document types from `./results/`
- Use AI to merge similar types into canonical forms
- Create a `dedupe_types.json` mapping file
- The website build will automatically use this mapping
**Example dedupe_types.json:**
```json
{
"stats": {
"original_types": 45,
"canonical_types": 12,
"reduction_percentage": 73.3
},
"mappings": {
"deposition": "Deposition",
"DEPOSITION": "Deposition",
"deposition transcript": "Deposition",
"court filing": "Court Filing"
}
}
```
### 5. Analyze Documents (Optional but Recommended)
Generate AI summaries and insights for each document:
```bash
python analyze_documents.py
# Options:
# --limit N # Analyze only N documents (for testing)
# --force # Re-analyze all documents (ignore existing)
```
This will:
- Group pages into documents (matching the website logic)
- Send each document's full text to the AI
- Generate summaries, key topics, key people, and significance analysis
- Save results to `analyses.json`
- Resume-friendly (skips already-analyzed documents)
**Example analysis output:**
```json
{
"document_type": "deposition",
"key_topics": ["Flight logs", "Private aircraft", "Passenger manifests"],
"key_people": [
{"name": "Jeffrey Epstein", "role": "Aircraft owner"}
],
"significance": "Documents flight records showing passenger lists...",
"summary": "This deposition contains testimony regarding..."
}
```
### 6. Generate Website
Build the static site from the processed data:
```bash
npm run build # Build static site to _site/
npm start # Development server with live reload
```
The build process will automatically:
- Apply deduplication if `dedupe.json` exists
- Load document analyses if `analyses.json` exists
- Generate a searchable analyses page
## How It Works
1. **Document Processing**: Images are sent to an AI vision model that extracts:
- All text in reading order
- Document metadata (page numbers, document numbers, dates)
- Named entities (people, orgs, locations)
- Text type annotations (printed, handwritten, stamps)
2. **Document Grouping**: Individual page scans are automatically grouped by document number and sorted by page number to reconstruct complete documents
3. **Static Site Generation**: 11ty processes the JSON data to create:
- Index pages for all entities
- Individual document pages with full text
- Search and browse interfaces
## Performance
- Processes ~2,000 pages into ~400 multi-page documents
- Handles LLM inconsistencies in document number formatting
- Resume-friendly processing (skip already-processed files)
- Parallel processing with configurable workers
## Contributing
This is an open archive project. Contributions welcome:
- Report issues with OCR accuracy
- Suggest UI improvements
- Add additional document sources
- Improve entity extraction
## Deployment
The site is automatically deployed to GitHub Pages on every push to the main branch.
### GitHub Pages Setup
1. Push this repository to GitHub: `https://github.com/epstein-docs/epstein-docs.github.io`
2. Go to Settings → Pages
3. Source: GitHub Actions
4. The workflow will automatically build and deploy the site
The site will be available at: `https://epstein-docs.github.io/`
## Future: Relationship Graphs
Once entities are deduplicated, the next step is to visualize relationships between people, organizations, and locations. Potential approaches:
### Static Graph Generation
1. **Pre-generate graph data** during the build process:
- Build a relationships JSON file showing connections (e.g., which people appear in the same documents)
- Generate D3.js/vis.js compatible graph data
- Include in static site for client-side rendering
2. **Graph types to consider**:
- **Co-occurrence network**: People who appear together in documents
- **Document timeline**: Documents plotted by date with entity connections
- **Organization membership**: People connected to organizations
- **Location network**: People and organizations connected by locations
3. **Implementation ideas**:
- Use D3.js force-directed graph for interactive visualization
- Use Cytoscape.js for more complex network analysis
- Generate static SVG graphs for each major entity
- Add graph pages to the 11ty build (e.g., `/graphs/people/`, `/graphs/timeline/`)
### Data Structure for Graphs
```json
{
"nodes": [
{"id": "Jeffrey Epstein", "type": "person", "doc_count": 250},
{"id": "Ghislaine Maxwell", "type": "person", "doc_count": 180}
],
"edges": [
{"source": "Jeffrey Epstein", "target": "Ghislaine Maxwell", "weight": 85, "shared_docs": 85}
]
}
```
The deduplication step is essential for accurate relationship mapping - without it, "Epstein" and "Jeffrey Epstein" would appear as separate nodes.
## Disclaimer
This is an independent archival project. Documents are sourced from public releases. The maintainers make no representations about completeness or accuracy of the archive.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
The code in this repository is open source and free to use. The documents themselves are public records.
**Repository**: https://github.com/epstein-docs/epstein-docs
## Support This Project
If you find this archive useful, consider supporting its maintenance and hosting:
**Bitcoin**: `bc1qmahlh5eql05w30cgf5taj3n23twmp0f5xcvnnz`