kraken

kraken is an open-source, turn-key automatic text recognition (ATR) system optimized for historical and non-Latin script writing. Designed as a universal text recognizer for the humanities, it directly addresses the unique challenges of digitizing historical documents.

Highly adaptable and trainable, kraken excels in scenarios often overlooked by general-purpose ATR engines. It is particularly well-suited for the long tail of digitization, supporting low-resource languages and diverse, non-conventional scripts common in humanities research.

kraken offers two primary interfaces for its functionality:

  • A flexible and customizable command-line interface, intended for most users, enabling chainable workflows implementing the recognition pipeline.

  • A comprehensive API, designed for developers building custom workflows, integrating kraken into other projects, or requiring fine-grained control over the ATR process.

This documentation is structured to guide you through kraken’s capabilities:

  • The Introduction to Automatic Text Recognition is a basic primer on the core concepts intended for readers without prior ATR or machine learning experience.

  • The Getting Started guide provides a concise introduction to installation and basic usage.

  • The User Guide offers detailed information on utilizing the CLI tools for various tasks and an introduction to the python API.

  • The Python API serves as a comprehensive reference for integrating kraken into your python projects.

Features

kraken’s main features are:

  • Fully trainable layout analysis, reading order, and character recognition.

  • Right-to-Left, BiDi, and Top-to-Bottom script support

  • ALTO, PageXML, abbyyXML, and hOCR output

  • Word bounding boxes and character cuts

  • Public repository of model files via HTRMoPo

  • Configurable recognition through a network specification language and a plugin system

Integrations

Through its API, kraken has also been integrated into various frontend applications and larger processing suites that offer graphical user interfaces or scaffolding for large-scale digitization:

  • eScriptorium: A web-based platform for annotating, transcribing, and training ATR models, tightly integrating kraken.

  • OCR4all: A comprehensive ATR framework designed for historical documents, offering a user-friendly interface for various ATR tasks.

  • OCR-D suite: A collection of tools for ATR-related tasks, aiming to build a full-stack ATR workflow for historical prints.

  • arkindex by Teklia: A platform for large-scale document analysis and indexing with kraken support through a plugin.

Community & Support

kraken is an open-source project driven by community contributions. We warmly welcome feedback, pull requests, bug reports, and feature suggestions on our github repository.

If you are looking for help, want to discuss features, or need support regarding integrations (particularly with eScriptorium), please join our community chat: eScriptorium Gitter Channel

License

kraken is provided under the terms and conditions of the Apache 2.0 License.

Funding

kraken is developed at Inria and the École Pratique des Hautes Études, Université PSL.

This project was funded in part by the European Union (ATRIUM, project number 101132163). This project was funded in part by the European Union (ERC, MiDRASH, project number 101071829). This project was partially funded through the RESILIENCE project, funded from the European Union’s Horizon 2020 Framework Programme for Research and Innovation.

Ce travail a bénéficié d’une aide de l’État gérée par l’Agence Nationale de la Recherche au titre du Programme d’Investissements d’Avenir portant la référence ANR-21-ESRE-0005 (Biblissima+).