[PR #5834] feat(ocr): add OCR #5328

New Issue

giteasync · 2025-10-01T16:42:27-05:00

giteasync commented

2025-10-01 16:42:27 -05:00

📋 Pull Request Information

Original PR: https://github.com/TriliumNext/Trilium/pull/5834
Author: @perfectra1n
Created: 6/21/2025
Status: 🔄 Open

Base: main ← Head: feat/add-ocr-capabilities

📝 Commits (10+)

c4a0219 feat(ocr): add unit tests, resolve double sent headers, and fix the wonderful tesseract.js path issues
33a5492 fix(package): referenced wrong tesseract.js lol
864543e feat(ocr): drop confidence down a little bit
a4adc51 fix(unit): resolve typecheck errors
f135622 feat(unit): ocr unit tests almost pass
d20b3d8 feat(unit): ocr tests almost pass...
80a9182 feat(unit): ocr tests almost pass...
7868ebe fix(unit): also fix broken llm test
09196c0 fix(ocr): obviously don't need this migration file anymore
4b5e8d3 Update playwright.yml

📊 Changes

40 files changed (+4843 additions, -92 deletions)

View changed files

📝 .github/instructions/nx.instructions.md (+1 -1)
📝 .github/workflows/playwright.yml (+0 -1)
📝 apps/client/src/components/root_command_executor.ts (+13 -0)
📝 apps/client/src/services/content_renderer.ts (+40 -4)
📝 apps/client/src/stylesheets/style.css (+23 -0)
📝 apps/client/src/translations/en/translation.json (+31 -1)
📝 apps/client/src/widgets/buttons/note_actions.ts (+9 -0)
📝 apps/client/src/widgets/note_detail.ts (+5 -0)
📝 apps/client/src/widgets/type_widgets/options/images/images.ts (+332 -0)
➕ apps/client/src/widgets/type_widgets/read_only_ocr_text.ts (+215 -0)
📝 apps/client/src/widgets/view_widgets/list_or_grid_view.ts (+2 -1)
📝 apps/server/package.json (+6 -1)
📝 apps/server/src/assets/db/schema.sql (+2 -0)
📝 apps/server/src/becca/entities/bblob.ts (+4 -1)
📝 apps/server/src/migrations/migrations.ts (+19 -0)
📝 apps/server/src/routes/api/llm.spec.ts (+3 -36)
➕ apps/server/src/routes/api/ocr.spec.ts (+75 -0)
➕ apps/server/src/routes/api/ocr.ts (+612 -0)
📝 apps/server/src/routes/api/options.ts (+7 -1)
📝 apps/server/src/routes/routes.ts (+11 -0)

...and 20 more files

📄 Description

This PR integrates OCR capabilities by orchestrating interactions between a new client-side UI, a set of server-side API endpoints, a core OCR service, the Tesseract.js library, and the existing database schema.

Key Features:

OCR Service: A new OcrService is introduced, utilizing the Tesseract.js library to perform OCR on images.
API Endpoints: Several new API endpoints are added to manage OCR tasks:
- POST /api/ocr/process-note/{noteId}: Triggers OCR processing for a specific image note.
- POST /api/ocr/process-attachment/{attachmentId}: Triggers OCR for a specific image attachment.
- GET /api/ocr/search: Searches for text within the extracted OCR data.
- POST /api/ocr/batch-process: Initiates a batch job to process all images that haven't been OCR'd yet.
- GET /api/ocr/batch-progress: Retrieves the progress of the ongoing batch OCR job.
- GET /api/ocr/stats: Provides statistics on OCR'd files.
- DELETE /api/ocr/delete/{blobId}: Deletes the OCR data for a specific image.
Client-Side UI: The image options have been updated to include:
- Enabling/disabling OCR.
- Setting the OCR language.
- Configuring a minimum confidence threshold for OCR results.
- A "Batch OCR" button to trigger the processing of all images.
- A progress bar to monitor the batch OCR process.
Database Integration: The extracted OCR text is stored in the blobs table, in a new ocr_text column. This allows
for efficient searching of image content.

Implementation Details:

The OcrService is responsible for all OCR-related logic, including initialization of Tesseract.js, text
extraction, and database interaction.
The service supports a variety of image formats, including JPEG, PNG, GIF, BMP, TIFF, and WEBP.
The client-side implementation in apps/client/src/widgets/type_widgets/options/images/images.ts provides an interface for managing OCR settings and initiating batch processing.
The API routes in apps/server/src/routes/api/ocr.ts expose the OCR functionality to the client.

Data Storage and Schema

The extracted text from an image is stored directly in the database.
Implementation:
- A new column, ocr_text (of type TEXT), has been added to the existing blobs table. The blobs table stores the actual file content (the image itself), so this new column adds the extracted text alongside the binary data it was derived from.
- Writing: The OCRService.storeOCRResult() method is responsible for persistence. It executes the SQL command: UPDATE blobs SET ocr_text = ? WHERE blobId = ?.
- Reading/Checking: To avoid reprocessing, the OCRService.getStoredOCRResult() method checks if text already exists using: SELECT ocr_text FROM blobs WHERE blobId = ?.
- Searching: The core search functionality leverages this new column. The OCRService.searchOCRResults() method performs a LIKE query to find matches: SELECT blobId, ocr_text FROM blobs WHERE ocr_text LIKE ?.

Core Logic: `OCRService` (`apps/server/src/services/ocr/ocr_service.ts`)

This class contains the primary business logic and orchestrates the entire OCR process.

How is it implemented?
- Initialization (initialize): The service doesn't initialize Tesseract on application startup. Instead, it's initialized on-demand the first time an OCR operation is requested. It correctly configures the paths for the Tesseract worker (worker-script/node/index.js) and the WebAssembly core (tesseract-core.wasm.js).
- Text Extraction (extractTextFromImage): This is the heart of the process. It takes a Buffer of image data, passes it to the Tesseract.worker.recognize() function, and awaits the result. It then formats the output into a structured OCRResult object, converting Tesseract's confidence score from a 0-100 scale to a 0-1 decimal.
- Processing Logic (processNoteOCR, processAttachmentOCR): These methods act as controllers. They fetch the relevant note or attachment from the database using the becca service, verify its MIME type is a supported image format, and check if OCR text already exists in the blobs table. If all checks pass, they retrieve the image content via .getContent() and pass the resulting buffer to extractTextFromImage. Finally, they persist the result using storeOCRResult.
- Batch Processing (startBatchProcessing, processBatchInBackground):
  - When a batch process is started, the service first queries the database to get a count of all image notes and attachments that do not have existing OCR data.
  - It stores the progress in an in-memory object: this.batchProcessingState. This object tracks the total number of images, the number processed, and the start time. Using in-memory state is efficient for tracking the live progress of a single, ongoing task.
  - The actual processing (processBatchInBackground) runs asynchronously without blocking the main thread. It iterates through the unprocessed images, calls the appropriate processing method (processNoteOCR or processAttachmentOCR) for each, and increments the processed count in the batchProcessingState.

Server API (`apps/server/src/routes/api/ocr.ts`)

This file acts as a thin routing layer, exposing the OCRService's functionality via HTTP endpoints.

How is it implemented?
- Each function (e.g., processNoteOCR, batchProcessOCR, getBatchProgress) corresponds to an API endpoint.
- It performs initial request validation (e.g., checking for required parameters like noteId).
- It calls the corresponding method in the ocrService (e.g., ocrService.startBatchProcessing()).
- It formats the response from the service into a JSON object and sends it back to the client with the appropriate HTTP status code.
- The getBatchProgress endpoint is particularly simple: it just calls ocrService.getBatchProgress() and returns the in-memory state object, allowing the client to poll for updates efficiently.

Client-Side UI (`apps/client/src/widgets/type_widgets/options/images/images.ts`)

This widget provides the user interface for interacting with the OCR features.

How is it implemented?
- It uses jQuery to manipulate the DOM, adding event listeners to checkboxes, dropdowns, and buttons.
- Starting a Batch Job (startBatchOcr): When the user clicks the "Start Batch OCR" button, this function is called. It first makes a POST request to the /api/ocr/batch-process endpoint to initiate the process on the server.
- Polling for Progress (pollBatchOcrProgress): Upon a successful response from the server, it begins polling. It calls itself recursively using setTimeout every second. In each call, it makes a GET request to /api/ocr/batch-progress.
- It uses the data from the polling response to update the UI in real-time, adjusting the width of the progress bar and updating the status text (e.g., "Processed 5 of 100 images").
- Once the polling response indicates inProgress: false, it stops the polling loop and displays a completion message.

Data Flow (Mermaid Diagram)

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/TriliumNext/Trilium/pull/5834 **Author:** [@perfectra1n](https://github.com/perfectra1n) **Created:** 6/21/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `feat/add-ocr-capabilities` --- ### 📝 Commits (10+) - [`c4a0219`](https://github.com/TriliumNext/Trilium/commit/c4a0219b18efa8b8b11cf6abb4347a7848a08505) feat(ocr): add unit tests, resolve double sent headers, and fix the wonderful tesseract.js path issues - [`33a5492`](https://github.com/TriliumNext/Trilium/commit/33a549202b1d5a5ffd1262490a5404d4e76d6321) fix(package): referenced wrong tesseract.js lol - [`864543e`](https://github.com/TriliumNext/Trilium/commit/864543e4f97fef76e73d03f43f3f54bdea9274a6) feat(ocr): drop confidence down a little bit - [`a4adc51`](https://github.com/TriliumNext/Trilium/commit/a4adc51e50ca760aaa7585b3ac2d38d20a2f1871) fix(unit): resolve typecheck errors - [`f135622`](https://github.com/TriliumNext/Trilium/commit/f1356228a3531077db2a84edebc72c0fc13253b7) feat(unit): ocr unit tests almost pass - [`d20b3d8`](https://github.com/TriliumNext/Trilium/commit/d20b3d854fcdb9ffc6b9744ac126db638f65225e) feat(unit): ocr tests almost pass... - [`80a9182`](https://github.com/TriliumNext/Trilium/commit/80a9182f054a3bc11a04e5ef94d452c14393a224) feat(unit): ocr tests almost pass... - [`7868ebe`](https://github.com/TriliumNext/Trilium/commit/7868ebec1ebf87cde9aa308269f0d60d9dfa6758) fix(unit): also fix broken llm test - [`09196c0`](https://github.com/TriliumNext/Trilium/commit/09196c045fd01609cf1d540d82976a9731030017) fix(ocr): obviously don't need this migration file anymore - [`4b5e8d3`](https://github.com/TriliumNext/Trilium/commit/4b5e8d33a6b8612101d3b7cac7487ed38894fcc3) Update playwright.yml ### 📊 Changes **40 files changed** (+4843 additions, -92 deletions) <details> <summary>View changed files</summary> 📝 `.github/instructions/nx.instructions.md` (+1 -1) 📝 `.github/workflows/playwright.yml` (+0 -1) 📝 `apps/client/src/components/root_command_executor.ts` (+13 -0) 📝 `apps/client/src/services/content_renderer.ts` (+40 -4) 📝 `apps/client/src/stylesheets/style.css` (+23 -0) 📝 `apps/client/src/translations/en/translation.json` (+31 -1) 📝 `apps/client/src/widgets/buttons/note_actions.ts` (+9 -0) 📝 `apps/client/src/widgets/note_detail.ts` (+5 -0) 📝 `apps/client/src/widgets/type_widgets/options/images/images.ts` (+332 -0) ➕ `apps/client/src/widgets/type_widgets/read_only_ocr_text.ts` (+215 -0) 📝 `apps/client/src/widgets/view_widgets/list_or_grid_view.ts` (+2 -1) 📝 `apps/server/package.json` (+6 -1) 📝 `apps/server/src/assets/db/schema.sql` (+2 -0) 📝 `apps/server/src/becca/entities/bblob.ts` (+4 -1) 📝 `apps/server/src/migrations/migrations.ts` (+19 -0) 📝 `apps/server/src/routes/api/llm.spec.ts` (+3 -36) ➕ `apps/server/src/routes/api/ocr.spec.ts` (+75 -0) ➕ `apps/server/src/routes/api/ocr.ts` (+612 -0) 📝 `apps/server/src/routes/api/options.ts` (+7 -1) 📝 `apps/server/src/routes/routes.ts` (+11 -0) _...and 20 more files_ </details> ### 📄 Description This PR integrates OCR capabilities by orchestrating interactions between a new client-side UI, a set of server-side API endpoints, a core OCR service, the Tesseract.js library, and the existing database schema. ### Key Features: * OCR Service: A new OcrService is introduced, utilizing the Tesseract.js library to perform OCR on images. * API Endpoints: Several new API endpoints are added to manage OCR tasks: * POST `/api/ocr/process-note/{noteId}`: Triggers OCR processing for a specific image note. * POST `/api/ocr/process-attachment/{attachmentId}`: Triggers OCR for a specific image attachment. * GET `/api/ocr/search`: Searches for text within the extracted OCR data. * POST `/api/ocr/batch-process`: Initiates a batch job to process all images that haven't been OCR'd yet. * GET `/api/ocr/batch-progress`: Retrieves the progress of the ongoing batch OCR job. * GET `/api/ocr/stats`: Provides statistics on OCR'd files. * DELETE `/api/ocr/delete/{blobId}`: Deletes the OCR data for a specific image. * Client-Side UI: The image options have been updated to include: * Enabling/disabling OCR. * Setting the OCR language. * Configuring a minimum confidence threshold for OCR results. * A "Batch OCR" button to trigger the processing of all images. * A progress bar to monitor the batch OCR process. * Database Integration: The extracted OCR text is stored in the blobs table, in a new `ocr_text` column. This allows for efficient searching of image content. ### Implementation Details: * The OcrService is responsible for all OCR-related logic, including initialization of Tesseract.js, text extraction, and database interaction. * The service supports a variety of image formats, including `JPEG, PNG, GIF, BMP, TIFF`, and `WEBP`. * The client-side implementation in `apps/client/src/widgets/type_widgets/options/images/images.ts` provides an interface for managing OCR settings and initiating batch processing. * The API routes in apps/server/src/routes/api/ocr.ts expose the OCR functionality to the client. #### Data Storage and Schema * The extracted text from an image is stored directly in the database. * **Implementation:** * A new column, `ocr_text` (of type `TEXT`), has been added to the existing `blobs` table. The `blobs` table stores the actual file content (the image itself), so this new column adds the extracted text alongside the binary data it was derived from. * **Writing:** The `OCRService.storeOCRResult()` method is responsible for persistence. It executes the SQL command: `UPDATE blobs SET ocr_text = ? WHERE blobId = ?`. * **Reading/Checking:** To avoid reprocessing, the `OCRService.getStoredOCRResult()` method checks if text already exists using: `SELECT ocr_text FROM blobs WHERE blobId = ?`. * **Searching:** The core search functionality leverages this new column. The `OCRService.searchOCRResults()` method performs a `LIKE` query to find matches: `SELECT blobId, ocr_text FROM blobs WHERE ocr_text LIKE ?`. #### Core Logic: `OCRService` (`apps/server/src/services/ocr/ocr_service.ts`) This class contains the primary business logic and orchestrates the entire OCR process. * **How is it implemented?** * **Initialization (`initialize`)**: The service doesn't initialize Tesseract on application startup. Instead, it's initialized on-demand the first time an OCR operation is requested. It correctly configures the paths for the Tesseract worker (`worker-script/node/index.js`) and the WebAssembly core (`tesseract-core.wasm.js`). * **Text Extraction (`extractTextFromImage`)**: This is the heart of the process. It takes a `Buffer` of image data, passes it to the `Tesseract.worker.recognize()` function, and awaits the result. It then formats the output into a structured `OCRResult` object, converting Tesseract's confidence score from a 0-100 scale to a 0-1 decimal. * **Processing Logic (`processNoteOCR`, `processAttachmentOCR`)**: These methods act as controllers. They fetch the relevant note or attachment from the database using the `becca` service, verify its MIME type is a supported image format, and check if OCR text already exists in the `blobs` table. If all checks pass, they retrieve the image content via `.getContent()` and pass the resulting buffer to `extractTextFromImage`. Finally, they persist the result using `storeOCRResult`. * **Batch Processing (`startBatchProcessing`, `processBatchInBackground`)**: * When a batch process is started, the service first queries the database to get a count of all image notes and attachments that do *not* have existing OCR data. * It stores the progress in an in-memory object: `this.batchProcessingState`. This object tracks the total number of images, the number processed, and the start time. Using in-memory state is efficient for tracking the live progress of a single, ongoing task. * The actual processing (`processBatchInBackground`) runs asynchronously without blocking the main thread. It iterates through the unprocessed images, calls the appropriate processing method (`processNoteOCR` or `processAttachmentOCR`) for each, and increments the `processed` count in the `batchProcessingState`. #### Server API (`apps/server/src/routes/api/ocr.ts`) This file acts as a thin routing layer, exposing the `OCRService`'s functionality via HTTP endpoints. * **How is it implemented?** * Each function (e.g., `processNoteOCR`, `batchProcessOCR`, `getBatchProgress`) corresponds to an API endpoint. * It performs initial request validation (e.g., checking for required parameters like `noteId`). * It calls the corresponding method in the `ocrService` (e.g., `ocrService.startBatchProcessing()`). * It formats the response from the service into a JSON object and sends it back to the client with the appropriate HTTP status code. * The `getBatchProgress` endpoint is particularly simple: it just calls `ocrService.getBatchProgress()` and returns the in-memory state object, allowing the client to poll for updates efficiently. #### Client-Side UI (`apps/client/src/widgets/type_widgets/options/images/images.ts`) This widget provides the user interface for interacting with the OCR features. * **How is it implemented?** * It uses jQuery to manipulate the DOM, adding event listeners to checkboxes, dropdowns, and buttons. * **Starting a Batch Job (`startBatchOcr`)**: When the user clicks the "Start Batch OCR" button, this function is called. It first makes a `POST` request to the `/api/ocr/batch-process` endpoint to initiate the process on the server. * **Polling for Progress (`pollBatchOcrProgress`)**: Upon a successful response from the server, it begins polling. It calls itself recursively using `setTimeout` every second. In each call, it makes a `GET` request to `/api/ocr/batch-progress`. * It uses the data from the polling response to update the UI in real-time, adjusting the width of the progress bar and updating the status text (e.g., "Processed 5 of 100 images"). * Once the polling response indicates `inProgress: false`, it stops the polling loop and displays a completion message. ### Data Flow (Mermaid Diagram) [![](https://mermaid.ink/img/pako:eNqlVm1v2kgQ_iuj_dKkcgwESoLVi9QjtIcESS6AKlVIp_V6MduYXXd3fW2K8t9v1i-EBofkKJ-wPfPMzDPPzO6aMBVxEhDDv2VcMn4paKzpai4BfynVVjCRUmlhZrjefdtPBHcfh3AkVjTmxrfmeNdswvW_XMOHG7RTTNcbXfdvc0PBeG71jyke6q2n3OB3yqz_1ex-vaSWhtQg0OTvkbAcAQojV8bJxUVt3oErh90ZmJOJRTD4k1q2dHnNSeFd64ZwNfUFcHM9mUKDpqKBrxqhwzpJtWKYeIFW44VY-2gIwLjE8rxuCigh46OSnn2eCLzDSQCTwWjQn0L_enY1PXp7DB9vr8cgleWmQa2lbLnCcg18_mtwOwCHaPkPC8MJXM1GoyLoDurJizXccptpaYCpDMlUCyi4BKvgF36uMBHQIl7mRvsxh1JYQRPxE3GEPFnxldL3ji3Lg_ehblys8TVSFmuED8DqjHsY0NIkgCuvisujAJrw8Ao6n2v6GkzG2CZGhVXX6udlWAOyxYhySLWe3gskFXpON8LBVn8HnUnp_gsJIWV3scauRKAkmDxnP2evDBfyWGDjUpUkzmWhtEPLOfWrHBOlUhig5z20EIMpGRUf_vcAfRrUzU8e7RHxkCGK-WaEcrRqgA7ueqXoJ8oDFX7lzO5Pdp8MXinZcSWzfRQ_E2SWRpip2fQRRaDRqMDjrndPpbePIQ9qtsyYU_l9KRKX_pJvq6ysAYSpZOj7GyEdvM7cAhtGHoSJCofR1lJ7ZpHBaDgeTqH12wvNBWm4qNi4hTq0ik_cFhsRF6S0rmduzopifjvFAjjMFovqOH8hwe1zNgCN8xxL3LJHBUQ5ONtGr88FW-B8eAR5NygKAiteiMhdRA5lb3Zz-WE6yOkyKIjpY7f_gDeorjelCEpxlC8P5nVSLOpDDiymeXG8bia5PBGFzHfuk2XiI28pp2jv5ECTpDw2feKRFdcrKiK8x61dJnOCc7bicxLg34jqO3d_eUA7mlk1uZeMBPk2ITiF8ZIEC5oYfMryVVDeACsTvE59UWrzyCNhlR4Xl8b87pibkGBNfpDgtOuft8-6p-1m57TX7PW6HrknQat96nfPm512q9vuNc-67XcPHvmZgzb9XqvXetduNTudLrqen3kk1q6SMkHcQFz3HS0kaHce_gNlLIcq?type=png)](https://mermaid.live/edit#pako:eNqlVm1v2kgQ_iuj_dKkcgwESoLVi9QjtIcESS6AKlVIp_V6MduYXXd3fW2K8t9v1i-EBofkKJ-wPfPMzDPPzO6aMBVxEhDDv2VcMn4paKzpai4BfynVVjCRUmlhZrjefdtPBHcfh3AkVjTmxrfmeNdswvW_XMOHG7RTTNcbXfdvc0PBeG71jyke6q2n3OB3yqz_1ex-vaSWhtQg0OTvkbAcAQojV8bJxUVt3oErh90ZmJOJRTD4k1q2dHnNSeFd64ZwNfUFcHM9mUKDpqKBrxqhwzpJtWKYeIFW44VY-2gIwLjE8rxuCigh46OSnn2eCLzDSQCTwWjQn0L_enY1PXp7DB9vr8cgleWmQa2lbLnCcg18_mtwOwCHaPkPC8MJXM1GoyLoDurJizXccptpaYCpDMlUCyi4BKvgF36uMBHQIl7mRvsxh1JYQRPxE3GEPFnxldL3ji3Lg_ehblys8TVSFmuED8DqjHsY0NIkgCuvisujAJrw8Ao6n2v6GkzG2CZGhVXX6udlWAOyxYhySLWe3gskFXpON8LBVn8HnUnp_gsJIWV3scauRKAkmDxnP2evDBfyWGDjUpUkzmWhtEPLOfWrHBOlUhig5z20EIMpGRUf_vcAfRrUzU8e7RHxkCGK-WaEcrRqgA7ueqXoJ8oDFX7lzO5Pdp8MXinZcSWzfRQ_E2SWRpip2fQRRaDRqMDjrndPpbePIQ9qtsyYU_l9KRKX_pJvq6ysAYSpZOj7GyEdvM7cAhtGHoSJCofR1lJ7ZpHBaDgeTqH12wvNBWm4qNi4hTq0ik_cFhsRF6S0rmduzopifjvFAjjMFovqOH8hwe1zNgCN8xxL3LJHBUQ5ONtGr88FW-B8eAR5NygKAiteiMhdRA5lb3Zz-WE6yOkyKIjpY7f_gDeorjelCEpxlC8P5nVSLOpDDiymeXG8bia5PBGFzHfuk2XiI28pp2jv5ECTpDw2feKRFdcrKiK8x61dJnOCc7bicxLg34jqO3d_eUA7mlk1uZeMBPk2ITiF8ZIEC5oYfMryVVDeACsTvE59UWrzyCNhlR4Xl8b87pibkGBNfpDgtOuft8-6p-1m57TX7PW6HrknQat96nfPm512q9vuNc-67XcPHvmZgzb9XqvXetduNTudLrqen3kk1q6SMkHcQFz3HS0kaHce_gNlLIcq) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

giteasync added the

pull-request

label 2025-10-01 16:42:27 -05:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: TriliumNext/Trilium#5328

[PR #5834] feat(ocr): add OCR #5328