Introduction to the Python API

kraken provides a powerful python API for programmatic access to all its functionality. This guide provides a basic introduction to the most important parts of the API.

High-Level API

The easiest way to use kraken programmatically is through the high-level API in the kraken.tasks module. This API provides a set of task-oriented classes for segmentation, recognition, and forced alignment.

Segmentation

To segment an image, you can use the SegmentationTaskModel class. It returns a Segmentation object containing the segmentation results.

The lines within the Segmentation object can be of two types, depending on the model used:

  • BaselineLine for models that output baselines and polygons.

  • BBoxLine for models that output bounding boxes.

from PIL import Image
from kraken.tasks import SegmentationTaskModel
from kraken.configs import SegmentationInferenceConfig

# Load the default segmentation model
model = SegmentationTaskModel.load_model()

im = Image.open('image.png')

config = SegmentationInferenceConfig()

segmentation = model.predict(im, config)

for line in segmentation.lines:
    print(line.baseline)

Recognition

To recognize the text in an image, you can use the RecognitionTaskModel class. This class takes a Segmentation object, a PIL image, and a configuration as inputs and returns an iterator of ocr_record objects.

Similar to segmentation, the returned records can be of two types:

from PIL import Image
from kraken.tasks import RecognitionTaskModel
from kraken.configs import RecognitionInferenceConfig

# Load a recognition model
model = RecognitionTaskModel.load_model('model.safetensors')

im = Image.ope('image.png')

config = RecognitionInferenceConfig()

# segmentation is a Segmentation object created by loading an XML file o
# running segmentation manually.
for record in model.predict(im, segmentation, config):
    print(record.prediction)

Forced Alignment

Forced alignment is the process of aligning a given transcription to the output of a text recognition model, producing approximate character locations. This is a specialized operation outside a normal ATR workflow and can be used, e.g., to produce word bounding boxes for a known good transcription.

You can use the ForcedAlignmentTaskModel class to perform forced alignment:

from PIL import Image
from kraken.tasks import ForcedAlignmentTaskModel
from kraken.containers import Segmentation, BaselineLine
from kraken.configs import RecognitionInferenceConfig

# `model.safetensor` is a recognition model
model = ForcedAlignmentTaskModel.load_model('model.safetensor')
im = Image.open('image.png')
line = BaselineLine(baseline=[(0,0), (100,0)], boundary=[(0,-10), (100,-10), (100,10), (0,10)], text='Hello World')
segmentation = Segmentation(lines=[line])
config = RecognitionInferenceConfig()

aligned_segmentation = model.predict(im, segmentation, config)
record = aligned_segmentation.lines[0]
print(record.prediction)
print(record.cuts)

Parsing XML

kraken can parse ALTO and PageXML files into Segmentation objects. This is useful for loading ground truth data or the results of other OCR engines. The XMLPage class handles this.

Note

The parser has been refactored in kraken 7.0 with changes to reading order parsing and robustness improvements. In particular, if the XML dimension field is invalid, kraken falls back to reading the source image to determine dimensions.

from kraken.lib.xml import XMLPage

xml_page = XMLPage('input.xml')
segmentation = xml_page.to_container()

Serialization

After segmentation and recognition, you can serialize the results into various formats, such as ALTO or PageXML, with the kraken.serialization.serialize() function.

from kraken.serialization import serialize

# Assume `segmentation` is a Segmentation object from a previous step
# and `im` is the PIL image object.

# Serialize to ALTO
alto_xml = serialize(segmentation, image_size=im.size, template='alto')

with open('output.alto.xml', 'w') as f:
    f.write(alto_xml)

# Serialize to PageXML
page_xml = serialize(segmentation, image_size=im.size, template='page')

with open('output.page.xml', 'w') as f:
    f.write(page_xml)

Plugin System

kraken features a plugin system that allows developers to extend its functionality with new commands, model types, and tasks. This system is based on python’s entry points mechanism and primarily targets pytorch-based implementations.

To create a plugin, you need to:

  1. Create a new python package that depends on kraken.

  2. In your package, create a class that implements the required interface.

  3. Register your class as an entry point in your package’s pyproject.toml or setup.cfg.

Entry Point Groups

kraken provides several entry point groups for different types of plugins:

  • kraken.cli: Adds new subcommands to the kraken command-line interface.

  • ketos.cli: Adds new subcommands to the ketos command-line interface.

  • kraken.models: Registers new model architectures.

  • kraken.lightning_modules: Registers new PyTorch Lightning modules for training and model conversion.

  • kraken.loaders: Registers new model loaders.

  • kraken.writers: Registers new model writers.

  • kraken.tasks: Registers new high-level tasks.

Model Plugins

The most common use case for plugins is to add new machine learning architectures for an already existing task type, such as defining a new segmentation method. This typically involves:

  1. Implementing a class that inherits from the requisite base model interface in kraken.models.base, such as RecognitionBaseModel for text recognition or SegmentationBaseModel for layout analysis.

  2. Registering this class in your plugin’s pyproject.toml or setup.cfg under the kraken.models entry point.

  3. Implement a checkpoint container that provides a load_from_checkpoint method and is registered under the kraken.lightning_modules entrypoint. The easiest way to ensure correct behavior is to implement this class as a lightning LightningModule.

  4. Optionally, adding a training command to ketos by creating a click command and registering it under the ketos.cli entry point.

For a complete example of a layout analysis model plugin, refer to the dfine_kraken project, which implements a D-FINE based segmentation method.

Low-Level API

For more fine-grained control, you can use the low-level API in the kraken.lib module. This API provides direct access to the core components of kraken, such as the neural network models and the CTC decoders.

For more information, please refer to the API Reference.