API Quickstart¶
Kraken provides routines which are usable by third party tools to access all functionality of the OCR engine. Most functional blocks, binarization, segmentation, recognition, and serialization are encapsulated in one high level method each.
Simple use cases of the API which are mostly useful for debugging purposes are contained in the contrib directory. In general it is recommended to look at this tutorial, these scripts, or the API reference. The command line drivers are unnecessarily complex for straightforward applications as they contain lots of boilerplate to enable all use cases.
Basic Concepts¶
The fundamental modules of the API are similar to the command line drivers. Image inputs and outputs are generally Pillow objects and numerical outputs numpy arrays.
Top-level modules implement high level functionality while kraken.lib
contains loaders and low level methods that usually should not be used if
access to intermediate results is not required.
Preprocessing and Segmentation¶
The primary preprocessing function is binarization although depending on the particular setup of the pipeline and the models utilized it can be optional. For the non-trainable legacy bounding box segmenter binarization is mandatory although it is still possible to feed color and grayscale images to the recognizer. The trainable baseline segmenter can work with black and white, grayscale, and color images, depending on the training data and netork configuration utilized; though grayscale and color data are used in almost all cases.
>>> from PIL import Image
>>> from kraken import binarization
# can be any supported image format and mode
>>> im = Image.open('foo.png')
>>> bw_im = binarization.nlbin(im)
Legacy segmentation¶
The basic parameter of the legacy segmenter consists just of a b/w image object, although some additional parameters exist, largely to change the principal text direction (important for column ordering and top-to-bottom scripts) and explicit masking of non-text image regions:
>>> from kraken import pageseg
>>> seg = pageseg.segment(bw_im)
>>> seg
{'text_direction': 'horizontal-lr',
'boxes': [[0, 29, 232, 56],
[28, 54, 121, 84],
[9, 73, 92, 117],
[103, 76, 145, 131],
[7, 105, 119, 230],
[10, 228, 126, 345],
...
],
'script_detection': False}
Baseline segmentation¶
The baseline segmentation method is based on a neural network that classifies image pixels into baselines and regions. Because it is trainable, a segmentation model is required in addition to the image to be segmentation and it has to be loaded first:
>>> from kraken import blla
>>> from kraken.lib import vgsl
>>> model_path = 'path/to/model/file'
>>> model = vgsl.TorchVGSLModel.load_model(model_path)
A segmentation model contains a basic neural network and associated metadata defining the available line and region types, bounding regions, and an auxiliary baseline location flag for the polygonizer:
Afterwards they can be fed into the segmentation method
kraken.blla.segment()
with image objects:
>>> from kraken import blla
>>> from kraken import serialization
>>> baseline_seg = blla.segment(im, model=model)
>>> baseline_seg
{'text_direction': 'horizontal-lr',
'type': 'baselines',
'script_detection': False,
'lines': [{'script': 'default',
'baseline': [[471, 1408], [524, 1412], [509, 1397], [1161, 1412], [1195, 1412]],
'boundary': [[471, 1408], [491, 1408], [515, 1385], [562, 1388], [575, 1377], ... [473, 1410]]},
...],
'regions': {'$tip':[[[536, 1716], ... [522, 1708], [524, 1716], [536, 1716], ...]
'$par': ...
'$nop': ...}}
>>> alto = serialization.serialize_segmentation(baseline_seg, image_name=im.filename, image_size=im.size, template='alto')
>>> with open('segmentation_output.xml', 'w') as fp:
fp.write(alto)
Optional parameters are largely the same as for the legacy segmenter, i.e. text direction and masking.
Images are automatically converted into the proper mode for recognition, except in the case of models trained on binary images as there is a plethora of different algorithms available, each with strengths and weaknesses. For most material the kraken-provided binarization should be sufficient, though. This does not mean that a segmentation model trained on RGB images will have equal accuracy for B/W, grayscale, and RGB inputs. Nevertheless the drop in quality will often be modest or non-existent for color models while non-binarized inputs to a binary model will cause severe degradation (and a warning to that notion).
Per default segmentation is performed on the CPU although the neural network can be run on a GPU with the device argument. As the vast majority of the processing required is postprocessing the performance gain will most likely modest though.
The above API is the most simple way to perform a complete segmentation. The process consists of multiple steps such as pixel labelling, separate region and baseline vectorization, and bounding polygon calculation:
It is possible to only run a subset of the functionality depending on one’s
needs by calling the respective functions in kraken.lib.segmentation
. As
part of the sub-library the API is not guaranteed to be stable but it generally
does not change much. Examples of more fine-grained use of the segmentation API
can be found in contrib/repolygonize.py
and contrib/segmentation_overlay.py.
Recognition¶
Recognition itself is a multi-step process with a neural network producing a matrix with a confidence value for possible outputs at each time step. This matrix is decoded into a sequence of integer labels (label domain) which are subsequently mapped into Unicode code points using a codec. Labels and code points usually correspond one-to-one, i.e. each label is mapped to exactly one Unicode code point, but if desired more complex codecs can map single labels to multiple code points, multiple labels to single code points, or multiple labels to multiple code points (see the Codec section for further information).
As the customization of this two-stage decoding process is usually reserved for specialized use cases, sensible defaults are chosen by default: codecs are part of the model file and do not have to be supplied manually; the preferred CTC decoder is an optional parameter of the recognition model object.
To perform text line recognition a neural network has to be loaded first. A
kraken.lib.models.TorchSeqRecognizer
is returned which is a wrapper
around the kraken.lib.vgsl.TorchVGSLModel
class seen above for
segmentation model loading.
>>> from kraken.lib import models
>>> rec_model_path = '/path/to/recognition/model'
>>> model = models.load_any(rec_model_path)
The sequence recognizer wrapper combines the neural network itself, a codec, metadata such as the if the input is supposed to be grayscale or binarized, and an instance of a CTC decoder that performs the conversion of the raw output tensor of the network into a sequence of labels:
Afterwards, given an image, a segmentation and the model one can perform text recognition. The code is identical for both legacy and baseline segmentations. Like for segmentation input images are auto-converted to the correct color mode, except in the case of binary models for which a warning will be raised if there is a mismatch for binary input models.
There are two methods for recognition, a basic single model call
kraken.rpred.rpred()
and a multi-model recognizer
kraken.rpred.mm_rpred()
. The latter is useful for recognizing
multi-scriptal documents, i.e. applying different models to different parts of
a document.
>>> from kraken import rpred
# single model recognition
>>> pred_it = rpred(model, im, baseline_seg)
>>> for record in pred_it:
print(record)
The output isn’t just a sequence of characters but an
kraken.rpred.ocr_record
record object containing the character
prediction, cuts (approximate locations), and confidences.
>>> record.cuts
>>> record.prediction
>>> record.confidences
it is also possible to access the original line information:
# for baselines
>>> record.type
'baselines'
>>> record.line
>>> record.baseline
>>> record.script
# for box lines
>>> record.type
'box'
>>> record.line
>>> record.script
Sometimes the undecoded raw output of the network is required. The \(C
\times W\) softmax output matrix is accessible as the outputs attribute on the
kraken.lib.models.TorchSeqRecognizer
after each step of the
kraken.rpred.rpred()
iterator. To get a mapping from the label space
\(C\) the network operates in to Unicode code points a codec is used. An
arbitrary sequence of labels can generate an arbitrary number of Unicode code
points although usually the relation is one-to-one.
>>> pred_it = rpred(model, im, baseline_seg)
>>> next(pred_it)
>>> model.output
>>> model.codec.l2c
{'\x01': ' ',
'\x02': '"',
'\x03': "'",
'\x04': '(',
'\x05': ')',
'\x06': '-',
'\x07': '/',
...
}
There are several different ways to convert the output matrix to a sequence of
labels that can be decoded into a character sequence. These are contained in
kraken.lib.ctc_decoder
with
kraken.lib.ctc_decoder.greedy_decoder()
being the default.
XML Parsing¶
Sometimes it is desired to take the data in an existing XML serialization
format like PageXML or ALTO and apply an OCR function on it. The
kraken.lib.xml
module includes parsers extracting information into data
structures processable with minimal transformtion by the functional blocks:
>>> from kraken.lib import xml
>>> alto_doc = '/path/to/alto'
>>> xml.parse_alto(alto_doc)
{'image': '/path/to/image/file',
'type': 'baselines',
'lines': [{'baseline': [(24, 2017), (25, 2078)],
'boundary': [(69, 2016), (70, 2077), (20, 2078), (19, 2017)],
'text': '',
'script': 'default'},
{'baseline': [(79, 2016), (79, 2041)],
'boundary': [(124, 2016), (124, 2041), (74, 2041), (74, 2016)],
'text': '',
'script': 'default'}, ...],
'regions': {'Image/Drawing/Figure': [[(-5, 3398), (207, 3398), (207, 2000), (-5, 2000)],
[(253, 3292), (668, 3292), (668, 3455), (253, 3455)],
[(216, -4), (1015, -4), (1015, 534), (216, 534)]],
'Handwritten text': [[(2426, 3367), (2483, 3367), (2483, 3414), (2426, 3414)],
[(1824, 3437), (2072, 3437), (2072, 3514), (1824, 3514)]],
...}
}
>>> page_doc = '/path/to/page'
>>> xml.parse_page(page_doc)
{'image': '/path/to/image/file',
'type': 'baselines',
'lines': [{'baseline': [(24, 2017), (25, 2078)],
'boundary': [(69, 2016), (70, 2077), (20, 2078), (19, 2017)],
'text': '',
'script': 'default'},
{'baseline': [(79, 2016), (79, 2041)],
'boundary': [(124, 2016), (124, 2041), (74, 2041), (74, 2016)],
'text': '',
'script': 'default'}, ...],
'regions': {'Image/Drawing/Figure': [[(-5, 3398), (207, 3398), (207, 2000), (-5, 2000)],
[(253, 3292), (668, 3292), (668, 3455), (253, 3455)],
[(216, -4), (1015, -4), (1015, 534), (216, 534)]],
'Handwritten text': [[(2426, 3367), (2483, 3367), (2483, 3414), (2426, 3414)],
[(1824, 3437), (2072, 3437), (2072, 3514), (1824, 3514)]],
...}
Serialization¶
The serialization module can be used to transform the ocr_records
returned by the prediction iterator into a text
based (most often XML) format for archival. The module renders jinja2 templates in kraken/templates through
the kraken.serialization.serialize()
function.
>>> from kraken.lib import serialization
>>> records = [record for record in pred_it]
>>> alto = serialization.serialize(records, image_name='path/to/image', image_size=im.size, template='alto')
>>> with open('output.xml', 'w') as fp:
fp.write(alto)
Training¶
Training is largely implemented with the pytorch lightning framework. There are separate LightningModule`s for recognition and segmentation training and a small wrapper around the lightning’s `Trainer class that mainly sets up model handling and verbosity options for the CLI.
>>> from kraken.lib.train import RecognitionModel, KrakenTrainer
>>> ground_truth = glob.glob('training/*.xml')
>>> training_files = ground_truth[:250] # training data is shuffled internally
>>> evaluation_files = ground_truth[250:]
>>> model = RecognitionModel(training_data=training_files, evaluation_data=evaluation_files, format_type='xml', augment=True)
>>> trainer = KrakenTrainer()
>>> trainer.fit(model)
Likewise for a baseline and region segmentation model:
>>> from kraken.lib.train import SegmentationModel, KrakenTrainer
>>> ground_truth = glob.glob('training/*.xml')
>>> training_files = ground_truth[:250] # training data is shuffled internally
>>> evaluation_files = ground_truth[250:]
>>> model = SegmentationModel(training_data=training_files, evaluation_data=evaluation_files, format_type='xml', augment=True)
>>> trainer = KrakenTrainer()
>>> trainer.fit(model)
When the fit() method is called the dataset is initialized and the training commences. Both can take quite a bit of time. To get insight into what exactly is happening the standard lightning callbacks can be attached to the trainer object:
>>> from pytorch_lightning.callbacks import Callback
>>> from kraken.lib.train import RecognitionModel, KrakenTrainer
>>> class MyPrintingCallback(Callback):
def on_init_start(self, trainer):
print("Starting to init trainer!")
def on_init_end(self, trainer):
print("trainer is init now")
def on_train_end(self, trainer, pl_module):
print("do something when training ends")
>>> ground_truth = glob.glob('training/*.xml')
>>> training_files = ground_truth[:250] # training data is shuffled internally
>>> evaluation_files = ground_truth[250:]
>>> model = RecognitionModel(training_data=training_files, evaluation_data=evaluation_files, format_type='xml', augment=True)
>>> trainer = KrakenTrainer(enable_progress_bar=False, callbacks=[MyPrintingCallback])
>>> trainer.fit(model)
Starting to init trainer!
trainer is init now
This is only a small subset of the training functionality. It is suggested to have a closer look at the command line parameters for features as transfer learning, region and baseline filtering, training continuation, and so on.