Warning: This document is for the development version of kraken. The latest version is 0.10.0.

kraken API

Kraken provides routines which are usable by third party tools. In general you can expect function in the kraken package to remain stable. We will try to keep these backward compatible, but as kraken is still in an early development stage and the API is still quite rudimentary nothing can be garantueed.

kraken.binarization module

kraken.binarization

An adaptive binarization algorithm.

kraken.binarization.is_bitonal(im)

Tests a PIL.Image for bitonality.

Parameters:im (PIL.Image) – Image to test
Returns:True if the image contains only two different color values. False otherwise.
kraken.binarization.nlbin(im, threshold=0.5, zoom=0.5, escale=1.0, border=0.1, perc=80, range=20, low=5, high=90)

Performs binarization using non-linear processing.

Parameters:
  • im (PIL.Image) –
  • threshold (float) –
  • zoom (float) – Zoom for background page estimation
  • escale (float) – Scale for estimating a mask over the text region
  • border (float) – Ignore this much of the border
  • perc (int) – Percentage for filters
  • range (int) – Range for filters
  • low (int) – Percentile for black estimation
  • high (int) – Percentile for white estimation
Returns:

PIL.Image containing the binarized image

Raises:

KrakenInputException when trying to binarize an empty image.

kraken.serialization module

kraken.serialization.serialize(records, image_name='', image_size=(0, 0), writing_mode='horizontal-tb', scripts=None, template='hocr')

Serializes a list of ocr_records into an output document.

Serializes a list of predictions and their corresponding positions by doing some hOCR-specific preprocessing and then renders them through one of several jinja2 templates.

Note: Empty records are ignored for serialization purposes.

Parameters:
  • records (iterable) – List of kraken.rpred.ocr_record
  • image_name (str) – Name of the source image
  • image_size (tuple) – Dimensions of the source image
  • writing_mode (str) – Sets the principal layout of lines and the direction in which blocks progress. Valid values are horizontal-tb, vertical-rl, and vertical-lr.
  • scripts (list) – List of scripts contained in the OCR records
  • template (str) – Selector for the serialization format. May be ‘hocr’ or ‘alto’.

kraken.pageseg module

kraken.pageseg

Layout analysis and script detection methods.

kraken.pageseg.segment(im, text_direction='horizontal-lr', scale=None, maxcolseps=2, black_colseps=False, no_hlines=True)

Segments a page into text lines.

Segments a page into text lines and returns the absolute coordinates of each line in reading order.

Parameters:
  • im (PIL.Image) – A bi-level page of mode ‘1’ or ‘L’
  • text_direction (str) – Principal direction of the text (horizontal-lr/rl/vertical-lr/rl)
  • scale (float) – Scale of the image
  • maxcolseps (int) – Maximum number of whitespace column separators
  • black_colseps (bool) – Whether column separators are assumed to be vertical black lines or not
  • no_hlines (bool) – Switch for horizontal line removal
Returns:

‘$dir’, ‘boxes’: [(x1, y1, x2, y2),…]}: A dictionary containing the text direction and a list of reading order sorted bounding boxes under the key ‘boxes’.

Return type:

{‘text_direction’

Raises:
  • KrakenInputException if the input image is not binarized or the text
  • direction is invalid.
kraken.pageseg.detect_scripts(im, bounds, model='/home/mittagessen/git/kraken/kraken/script.mlmodel', valid_scripts=None)

Detects scripts in a segmented page.

Classifies lines returned by the page segmenter into runs of scripts/writing systems.

Parameters:
  • im (PIL.Image) – A bi-level page of mode ‘1’ or ‘L’
  • bounds (dict) – A dictionary containing a ‘boxes’ entry with a list of coordinates (x0, y0, x1, y1) of a text line in the image and an entry ‘text_direction’ containing ‘horizontal-lr/rl/vertical-lr/rl’.
  • model (str) – Location of the script classification model or None for default.
  • valid_scripts (list) – List of valid scripts.
Returns:

True, ‘text_direction’: ‘$dir’, ‘boxes’: [[(script, (x1, y1, x2, y2)),…]]}: A dictionary containing the text direction and a list of lists of reading order sorted bounding boxes under the key ‘boxes’ with each list containing the script segmentation of a single line. Script is a ISO15924 4 character identifier.

Return type:

{‘script_detection’

Raises:

KrakenInvalidModelException if no clstm module is available.

kraken.rpred module

kraken.rpred

Generators for recognition on lines images.

class kraken.rpred.ocr_record(prediction, cuts, confidences)

Bases: object

A record object containing the recognition result of a single line

kraken.rpred.bidi_record(record)

Reorders a record using the Unicode BiDi algorithm.

Models trained for RTL or mixed scripts still emit classes in LTR order requiring reordering for proper display.

Parameters:record (kraken.rpred.ocr_record) –
Returns:kraken.rpred.ocr_record
kraken.rpred.mm_rpred(nets, im, bounds, pad=16, bidi_reordering=True, script_ignore=None)

Multi-model version of kraken.rpred.rpred.

Takes a dictionary of ISO15924 script identifiers->models and an script-annotated segmentation to dynamically select appropriate models for these lines.

Parameters:
  • nets (dict) – A dict mapping ISO15924 identifiers to TorchSegRecognizer objects. Recommended to be an defaultdict.
  • im (PIL.Image) – Image to extract text from bounds (dict): A dictionary containing a ‘boxes’ entry with a list of lists of coordinates (script, (x0, y0, x1, y1)) of a text line in the image and an entry ‘text_direction’ containing ‘horizontal-lr/rl/vertical-lr/rl’.
  • pad (int) – Extra blank padding to the left and right of text line
  • bidi_reordering (bool) – Reorder classes in the ocr_record according to the Unicode bidirectional algorithm for correct display.
  • script_ignore (list) – List of scripts to ignore during recognition
Yields:

An ocr_record containing the recognized text, absolute character positions, and confidence values for each character.

Raises:
  • KrakenInputException if the mapping between segmentation scripts and
  • networks is incomplete.
kraken.rpred.rpred(network, im, bounds, pad=16, bidi_reordering=True)

Uses a RNN to recognize text

Parameters:
  • network (kraken.lib.models.TorchSeqRecognizer) – A TorchSegRecognizer object
  • im (PIL.Image) – Image to extract text from
  • bounds (dict) – A dictionary containing a ‘boxes’ entry with a list of coordinates (x0, y0, x1, y1) of a text line in the image and an entry ‘text_direction’ containing ‘horizontal-lr/rl/vertical-lr/rl’.
  • pad (int) – Extra blank padding to the left and right of text line. Auto-disabled when expected network inputs are incompatible with padding.
  • bidi_reordering (bool) – Reorder classes in the ocr_record according to the Unicode bidirectional algorithm for correct display.
Yields:

An ocr_record containing the recognized text, absolute character positions, and confidence values for each character.

kraken.transcribe module

Utility functions for ground truth transcription.

kraken.linegen module

linegen

An advanced line generation tool using Pango for proper text shaping. The actual drawing code was adapted from the create_image utility from nototools available at [0].

Line degradation uses a local model described in [1].

[0] https://github.com/googlei18n/nototools [1] Kanungo, Tapas, et al. “A statistical, nonparametric methodology for document degradation model validation.” IEEE Transactions on Pattern Analysis and Mach ine Intelligence 22.11 (2000): 1209-1223.

class kraken.linegen.LineGenerator(family='Sans', font_size=32, font_weight=400, language=None)

Bases: object

Produces degraded line images using a single collection of font families.

render_line(text)

Draws a line onto a Cairo surface which will be converted to an pillow Image.

Parameters:

text (unicode) – A string which will be rendered as a single line.

Returns:

PIL.Image of mode ‘L’.

Raises:
  • KrakenCairoSurfaceException if the Cairo surface couldn’t be created
  • (usually caused by invalid dimensions.
kraken.linegen.ocropy_degrade(im, distort=1.0, dsigma=20.0, eps=0.03, delta=0.3, degradations=(0.5, 0.0, 0.5, 0.0))

Degrades and distorts a line using the same noise model used by ocropus.

Parameters:
  • im (PIL.Image) – Input image
  • distort (float) –
  • dsigma (float) –
  • eps (float) –
  • delta (float) –
  • degradations (list) – list returning 4-tuples corresponding to the degradations argument of ocropus-linegen.
Returns:

PIL.Image in mode ‘L’

kraken.linegen.degrade_line(im, eta=0.0, alpha=1.5, beta=1.5, alpha_0=1.0, beta_0=1.0)

Degrades a line image by adding noise.

For parameter meanings consult [1].

Parameters:
  • im (PIL.Image) – Input image
  • eta (float) –
  • alpha (float) –
  • beta (float) –
  • alpha_0 (float) –
  • beta_0 (float) –
Returns:

PIL.Image in mode ‘1’

kraken.linegen.distort_line(im, distort=3.0, sigma=10, eps=0.03, delta=0.3)

Distorts a line image.

Run BEFORE degrade_line as a white border of 5 pixels will be added.

Parameters:
  • im (PIL.Image) – Input image
  • distort (float) –
  • sigma (float) –
  • eps (float) –
  • delta (float) –
Returns:

PIL.Image in mode ‘L’

kraken.lib.models module

kraken.lib.models

Wrapper around TorchVGSLModel including a variety of forward pass helpers for sequence classification.

class kraken.lib.models.TorchSeqRecognizer(nn, decoder=<function greedy_decoder>, train=False, device='cpu')

Bases: object

A class wrapping a TorchVGSLModel with a more comfortable recognition interface.

forward(line)

Performs a forward pass on a torch tensor of a line with shape (C, H, W) and returns a numpy array (W, C).

predict(line)

Performs a forward pass on a torch tensor of a line with shape (C, H, W) and returns the decoding as a list of tuples (string, start, end, confidence).

predict_labels(line)

Performs a forward pass on a torch tensor of a line with shape (C, H, W) and returns a list of tuples (class, start, end, max). Max is the maximum value of the softmax layer in the region.

predict_string(line)

Performs a forward pass on a torch tensor of a line with shape (C, H, W) and returns a string of the results.

to(device)

Moves model to device and automatically loads input tensors onto it.

kraken.lib.models.load_any(fname, train=False, device='cpu')

Loads anything that was, is, and will be a valid ocropus model and instantiates a shiny new kraken.lib.lstm.SeqRecognizer from the RNN configuration in the file.

Currently it recognizes the following kinds of models:

  • pyrnn models containing BIDILSTMs
  • protobuf models containing converted python BIDILSTMs
  • protobuf models containing CLSTM networks

Additionally an attribute ‘kind’ will be added to the SeqRecognizer containing a string representation of the source kind. Current known values are:

  • pyrnn for pickled BIDILSTMs
  • clstm for protobuf models generated by clstm
Parameters:
  • fname (str) – Path to the model
  • train (bool) – Enables gradient calculation and dropout layers in model.
  • device (str) – Target device
Returns:

A kraken.lib.models.TorchSeqRecognizer object.

kraken.lib.vgsl module

VGSL plumbing

class kraken.lib.vgsl.TorchVGSLModel(spec)

Bases: object

Class building a torch module from a VSGL spec.

The initialized class will contain a variable number of layers and a loss function. Inputs and outputs are always 4D tensors in order (batch, channels, height, width) with channels always being the feature dimension.

Importantly this means that a recurrent network will be fed the channel vector at each step along its time axis, i.e. either put the non-time-axis dimension into the channels dimension or use a summarizing RNN squashing the time axis to 1 and putting the output into the channels dimension respectively.

input

tuple – Expected input tensor as a 4-tuple.

nn

torch.nn.Sequential – Stack of layers parsed from the spec.

criterion

torch.nn.Module – Fully parametrized loss function.

add_codec(codec)

Adds a PytorchCodec to the model.

append(idx, spec)

Splits a model at layer idx and append layers spec.

New layers are initialized using the init_weights method.

Parameters:
  • idx (int) – Index of layer to append spec to starting with 1. To select the whole layer stack set idx to None.
  • spec (str) – VGSL spec without input block to append to model.
build_conv(input, block)

Builds a 2D convolution layer.

build_maxpool(input, block)

Builds a maxpool layer.

build_output(input, block)

Builds an output layer.

build_reshape(input, block)

Builds a reshape layer

build_rnn(input, block)

Builds an LSTM/GRU layer returning number of outputs and layer.

eval()

Sets the model to evaluation/inference mode, disabling dropout and gradient calculation.

get_layer_name(layer, name=None)

Generates a unique identifier for the layer optionally using a supplied name.

Parameters:
  • layer (str) – Identifier of the layer type
  • name (str) – user-supplied {name} with {} that need removing.
Returns:

(str) network unique layer name

init_weights(idx=slice(0, None, None))

Initializes weights for all or a subset of layers in the graph.

LSTM/GRU layers are orthogonally initialized, convolutional layers uniformly from (-0.1,0.1).

Parameters:idx (slice) – A slice object representing the indices of layers to initialize.
classmethod load_clstm_model(path)

Loads an CLSTM model to VGSL.

classmethod load_model(path)

Deserializes a VGSL model from a CoreML file.

Parameters:path (str) – CoreML file
classmethod load_pronn_model(path)

Loads an pronn model to VGSL.

classmethod load_pyrnn_model(path)

Loads an pyrnn model to VGSL.

resize_output(output_size, del_indices=None)

Resizes an output linear projection layer.

Parameters:
  • output_size (int) – New size of the linear layer
  • del_indices (list) – list of outputs to delete from layer
save_model(path)

Serializes the model into path.

Parameters:path (str) – Target destination
static set_layer_name(layer, name)

Sets the name field of an VGSL layer definition.

Parameters:
  • layer (str) – VGSL definition
  • name (str) – Layer name
set_num_threads(num)

Sets number of OpenMP threads to use.

train()

Sets the model to training mode (enables dropout layers and disables softmax on CTC layers).

kraken.lib.codec

pytorch compatible codec with many-to-many mapping between labels and graphemes.

class kraken.lib.codec.PytorchCodec(charset)

Bases: object

Translates between labels and graphemes.

decode(labels)

Decodes a labelling.

Given a labelling with cuts and confidences returns a string with the cuts and confidences aggregated across label-code point correspondences. When decoding multilabels to code points the resulting cuts are min/max, confidences are averaged.

Parameters:labels (list) – Input containing tuples (label, start, end, confidence).
Returns:A list of tuples (code point, start, end, confidence)
Return type:list
encode(s)

Encodes a string into a sequence of labels.

Parameters:s (str) – Input unicode string
Returns:(torch.IntTensor) encoded label sequence
Raises:KrakenEncodeException if encoding fails.
max_label()

Returns the maximum label value.

merge(codec)

Transforms this codec (c1) into another (c2) reusing as many labels as possible.

The resulting codec is able to encode the same code point sequences while not necessarily having the same labels for them as c2. Retains matching character -> label mappings from both codecs, removes mappings not c2, and adds mappings not in c1. Compound labels in c2 for code point sequences not in c1 containing labels also in use in c1 are added as separate labels.

Parameters:codec (kraken.lib.codec.PytorchCodec) –
Returns:A merged codec and a list of labels that were removed from the original codec.

kraken.lib.train module

Training loop interception helpers

class kraken.lib.train.EarlyStopping(it=None, min_delta=0.002, lag=5)

Bases: object

Early stopping to terminate training when validation loss doesn’t improve over a certain time.

update(val_loss)

Updates the internal validation loss state

class kraken.lib.train.EpochStopping(it=None, epochs=100)

Bases: object

Dumb stopping after a fixed number of epochs.

update(val_loss)

No-Op for this stopper

kraken.lib.dataset module

Utility functions for data loading and training of VGSL networks.

class kraken.lib.dataset.GroundTruthDataset(split=<function GroundTruthDataset.<lambda>>, suffix='.gt.txt', normalization=None, reorder=True, im_transforms=None, preload=True)

Bases: torch.utils.data.dataset.Dataset

Dataset for ground truth used during training.

All data is cached in memory.

training_set

list – List of tuples (image, text) for training

test_set

list – List of tuples (image, text) for testing

alphabet

str – Sorted string of all code points found in the ground truth

add(image)

Adds a line-image-text pair to the dataset.

Parameters:image (str) – Input image path
encode(codec=None)

Adds a codec to the dataset and encodes all text lines.

Has to be run before sampling from the dataset.

kraken.lib.dataset.compute_error(model, test_set)

Computes detailed error report from a model and a list of line image-text pairs.

Parameters:
Returns:

A tuple with total number of characters and edit distance across the whole test set.

kraken.lib.dataset.generate_input_transforms(batch, height, width, channels, pad)

Generates a torchvision transformation converting a PIL.Image into a tensor usable in a network forward pass.

Parameters:
  • batch (int) – mini-batch size
  • height (int) – height of input image in pixels
  • width (int) – width of input image in pixels
  • channels (int) – color channels of input
  • pad (int) – Amount of padding on horizontal ends of image
Returns:

A torchvision transformation composition converting the input image to the appropriate tensor.

kraken.lib.ctc

kraken.lib.layers

Various pytorch layers compatible with the dimension ordering and inputs/outputs of VGSL-defined networks.

class kraken.lib.ctc.CTCCriterion

Bases: torch.nn.modules.module.Module

Connectionist Temporal Classification loss function.

This class performs the softmax operation (increases numerical stability) for you, so inputs should be unnormalized linear projections from an RNN. For the same reason, forward-backward computations are performed in log domain.

Shape:
  • Input: where C number of classes, S sequence length, and N number of batches.
  • Target: , N number of label sequences l.
  • Output: scalar. If reduce is False, then
forward(input, targets)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

kraken.lib.ctc_decoder

Decoders for softmax outputs of CTC trained networks.

kraken.lib.ctc_decoder.beam_decoder(outputs, beam_size=3)

Translates back the network output to a label sequence using same-prefix-merge beam search decoding as described in [0].

[0] Hannun, Awni Y., et al. “First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs.” arXiv preprint arXiv:1408.2873 (2014).

Parameters:output (numpy.array) – (C, W) shaped softmax output tensor
Returns:A list with tuples (class, start, end, prob). max is the maximum value of the softmax layer in the region.
kraken.lib.ctc_decoder.greedy_decoder(outputs)

Translates back the network output to a label sequence using greedy/best path decoding as described in [0].

Thresholds on class 0, then assigns the maximum (non-zero) class to each region.

[0] Graves, Alex, et al. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.” Proceedings of the 23rd international conference on Machine learning. ACM, 2006.

Parameters:output (numpy.array) – (C, W) shaped softmax output tensor
Returns:A list with tuples (class, start, end, max). max is the maximum value of the softmax layer in the region.
kraken.lib.ctc_decoder.blank_threshold_decoder(outputs, threshold=0.5)

Translates back the network output to a label sequence as the original ocropy/clstm.

Thresholds on class 0, then assigns the maximum (non-zero) class to each region.

Parameters:
  • output (numpy.array) – (C, W) shaped softmax output tensor
  • threshold (float) – Threshold for 0 class when determining possible label locations.
Returns:

A list with tuples (class, start, end, max). max is the maximum value of the softmax layer in the region.