Training

This page describes the training utilities available through the ketos command line utility in depth. For a gentle introduction on model training please refer to the tutorial.

There are currently three trainable components in the kraken processing pipeline: * Segmentation: finding lines and regions in images * Reading Order: ordering lines found in the previous segmentation step. Reading order models are closely linked to segmentation models and both are usually trained on the same dataset. * Recognition: recognition models transform images of lines into text.

Depending on the use case it is not necessary to manually train new models for each material. The default segmentation model works well on quite a variety of handwritten and printed documents, a reading order model might not perform better than the default heuristic for simple text flows, and there are recognition models for some types of material available in the repository.

Best practices

Recognition model training

  • The default architecture works well for decently sized datasets.

  • Use precompiled binary datasets and put them in a place where they can be memory mapped during training (local storage, not NFS or similar).

  • Use the --logger flag to track your training metrics across experiments using Tensorboard.

  • If the network doesn’t converge before the early stopping aborts training, increase --min-epochs or --lag. Use the --logger option to inspect your training loss.

  • Use the flag --augment to activate data augmentation.

  • Increase the amount of --workers to speedup data loading. This is essential when you use the --augment option.

  • When using an Nvidia GPU, set the --precision option to 16 to use automatic mixed precision (AMP). This can provide significant speedup without any loss in accuracy.

  • Use option -B to scale batch size until GPU utilization reaches 100%. When using a larger batch size, it is recommended to use option -r to scale the learning rate by the square root of the batch size (1e-3 * sqrt(batch_size)).

  • When fine-tuning, it is recommended to use new mode not union as the network will rapidly unlearn missing labels in the new dataset.

  • If the new dataset is fairly dissimilar or your base model has been pretrained with ketos pretrain, use --warmup in conjunction with --freeze-backbone for one 1 or 2 epochs.

  • Upload your models to the model repository.

Segmentation model training

  • The segmenter is fairly robust when it comes to hyperparameter choice.

  • Start by finetuning from the default model for a fixed number of epochs (50 for reasonably sized datasets) with a cosine schedule.

  • Segmentation models’ performance is difficult to evaluate. Pixel accuracy doesn’t mean much because there are many more pixels that aren’t part of a line or region than just background. Frequency-weighted IoU is good for overall performance, while mean IoU overrepresents rare classes. The best way to evaluate segmentation models is to look at the output on unlabelled data.

  • If you don’t have rare classes you can use a fairly small validation set to make sure everything is converging and just visually validate on unlabelled data.

Training data formats

The training tools accept a variety of training data formats, usually some kind of custom low level format, the XML-based formats that are commony used for archival of annotation and transcription data, and in the case of recognizer training a precompiled binary format. It is recommended to use the XML formats for segmentation and reading order training and the binary format for recognition training.

ALTO

Kraken parses and produces files according to ALTO 4.3. An example showing the attributes necessary for segmentation, recognition, and reading order training follows:

<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns="http://www.loc.gov/standards/alto/ns-v4#"
	xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-0.xsd">
	<Description>
		<sourceImageInformation>
			<fileName>filename.jpg</fileName><!-- relative path in relation to XML location of the image file-->
		</sourceImageInformation>
		....
	</Description>
	<Layout>
		<Page...>
			<PrintSpace...>
				<ComposedBlockType ID="block_I"
						   HPOS="125"
						   VPOS="523"
						   WIDTH="5234"
						   HEIGHT="4000"
						   TYPE="region_type"><!-- for textlines part of a semantic region -->
					<TextBlock ID="textblock_N">
						<TextLine ID="line_0"
							  HPOS="..."
							  VPOS="..."
							  WIDTH="..."
							  HEIGHT="..."
							  BASELINE="10 20 15 20 400 20"><!-- necessary for segmentation training -->
							<String ID="segment_K"
								CONTENT="word_text"><!-- necessary for recognition training. Text is retrieved from <String> and <SP> tags. Lower level glyphs are ignored. -->
								...
							</String>
							<SP.../>
						</TextLine>
					</TextBlock>
				</ComposedBlockType>
				<TextBlock ID="textblock_M"><!-- for textlines not part of a region -->
				...
				</TextBlock>
			</PrintSpace>
		</Page>
	</Layout>
</alto>

Importantly, the parser only works with measurements in the pixel domain, i.e. an unset MeasurementUnit or one with an element value of pixel. In addition, as the minimal version required for ingestion is quite new it is likely that most existing ALTO documents will not contain sufficient information to be used with kraken out of the box.

PAGE XML

PAGE XML is parsed and produced according to the 2019-07-15 version of the schema, although the parser is not strict and works with non-conformant output from a variety of tools. As with ALTO, PAGE XML files can be used to train segmentation, reading order, and recognition models.

<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
	<Metadata>...</Metadata>
	<Page imageFilename="filename.jpg"...><!-- relative path to an image file from the location of the XML document -->
		<TextRegion id="block_N"
			    custom="structure {type:region_type;}"><!-- region type is a free text field-->
			<Coords points="10,20 500,20 400,200, 500,300, 10,300 5,80"/><!-- polygon for region boundary -->
			<TextLine id="line_K">
				<Baseline points="80,200 100,210, 400,198"/><!-- required for baseline segmentation training -->
				<TextEquiv><Unicode>text text text</Unicode></TextEquiv><!-- only TextEquiv tags immediately below the TextLine tag are parsed for recognition training -->
				<Word>
				...
			</TextLine>
			....
		</TextRegion>
		<TextRegion id="textblock_M"><!-- for lines not contained in any region. TextRegions without a type are automatically assigned the 'text' type which can be filtered out for training. -->
			<Coords points="0,0 0,{{ page.size[1] }} {{ page.size[0] }},{{ page.size[1] }} {{ page.size[0] }},0"/>
			<TextLine>...</TextLine><!-- same as above -->
			....
                </TextRegion>
	</Page>
</PcGts>

Binary Datasets

In addition to training recognition models directly from XML and image files, a binary dataset format offering a couple of advantages is supported for recognition training. Binary datasets drastically improve loading performance allowing the saturation of most GPUs with minimal computational overhead while also allowing training with datasets that are larger than the systems main memory. A minor drawback is a ~30% increase in dataset size in comparison to the raw images + XML approach.

To realize this speedup the dataset has to be compiled first:

$ ketos compile -f xml -o dataset.arrow file_1.xml file_2.xml ...

if there are a lot of individual lines containing many lines this process can take a long time. It can easily be parallelized by specifying the number of separate parsing workers with the --workers option:

$ ketos compile --workers 8 -f xml ...

In addition, binary datasets can contain fixed splits which allow reproducibility and comparability between training and evaluation runs. Training, validation, and test splits can be pre-defined from multiple sources. Per default they are sourced from tags defined in the source XML files unless the option telling kraken to ignore them is set:

$ ketos compile --ignore-splits -f xml ...

Alternatively fixed-proportion random splits can be created ad-hoc during compile time:

$ ketos compile --random-split 0.8 0.1 0.1 ...

The above line splits assigns 80% of the source lines to the training set, 10% to the validation set, and 10% to the test set. The training and validation sets in the dataset file are used automatically by ketos train (unless told otherwise) while the remaining 10% of the test set is selected by ketos test.

Recognition training

The training utility allows training of VGSL specified models both from scratch and from existing models. Here are its most important command line options:

option

action

-o, --output

Output model file prefix. Defaults to model.

-s, --spec

VGSL spec of the network to train. CTC layer will be added automatically. default: [1,48,0,1 Cr3,3,32 Do0.1,2 Mp2,2 Cr3,3,64 Do0.1,2 Mp2,2 S1(1x12)1,3 Lbx100 Do]

-a, --append

Removes layers before argument and then appends spec. Only works when loading an existing model

-i, --load

Load existing file to continue training

-F, --savefreq

Model save frequency in epochs during training

-q, --quit

Stop condition for training. Set to early for early stopping (default) or fixed for fixed number of epochs.

-N, --epochs

Number of epochs to train for.

--min-epochs

Minimum number of epochs to train for when using early stopping.

--lag

Number of epochs to wait before stopping training without improvement. Only used when using early stopping.

-d, --device

Select device to use (cpu, cuda:0, cuda:1,…). GPU acceleration requires CUDA.

--optimizer

Select optimizer (Adam, SGD, RMSprop).

-r, --lrate

Learning rate [default: 0.001]

-m, --momentum

Momentum used with SGD optimizer. Ignored otherwise.

-w, --weight-decay

Weight decay.

--schedule

Sets the learning rate scheduler. May be either constant, 1cycle, exponential, cosine, step, or reduceonplateau. For 1cycle the cycle length is determined by the –epoch option.

-p, --partition

Ground truth data partition ratio between train/validation set

-u, --normalization

Ground truth Unicode normalization. One of NFC, NFKC, NFD, NFKD.

-c, --codec

Load a codec JSON definition (invalid if loading existing model)

--resize

Codec/output layer resizing option. If set to union code points will be added, new will set the layer to match exactly the training data, fail will abort if training data and model codec do not match. Only valid when refining an existing model.

-n, --reorder / --no-reorder

Reordering of code points to display order.

-t, --training-files

File(s) with additional paths to training data. Used to enforce an explicit train/validation set split and deal with training sets with more lines than the command line can process. Can be used more than once.

-e, --evaluation-files

File(s) with paths to evaluation data. Overrides the -p parameter.

-f, --format-type

Sets the training and evaluation data format. Valid choices are ‘path’, ‘xml’ (default), ‘alto’, ‘page’, or binary. In alto, page, and xml mode all data is extracted from XML files containing both baselines and a link to source images. In path mode arguments are image files sharing a prefix up to the last extension with JSON .path files containing the baseline information. In binary mode arguments are precompiled binary dataset files.

--augment / --no-augment

Enables/disables data augmentation.

--workers

Number of OpenMP threads and workers used to perform neural network passes and load samples from the dataset.

From Scratch

The absolute minimal example to train a new recognition model from a number of ALTO or PAGE XML documents is similar to the segmentation training:

$ ketos train -f xml training_data/*.xml

Training will continue until the error does not improve anymore and the best model (among intermediate results) will be saved in the current directory; this approach is called early stopping.

In some cases changing the network architecture might be useful. One such example would be material that is not well recognized in the grayscale domain, as the default architecture definition converts images into grayscale. The input definition can be changed quite easily to train on color data (RGB) instead:

$ ketos train -f page -s '[1,120,0,3 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do0.1,2 Lbx200 Do]]' syr/*.xml

Complete documentation for the network description language can be found on the VGSL page.

Sometimes the early stopping default parameters might produce suboptimal results such as stopping training too soon. Adjusting the lag can be useful:

$ ketos train --lag 10 syr/*.png

To switch optimizers from Adam to SGD or RMSprop just set the option:

$ ketos train --optimizer SGD syr/*.png

It is possible to resume training from a previously saved model:

$ ketos train -i model_25.mlmodel syr/*.png

A good configuration for a small precompiled print dataset and GPU acceleration would be:

$ ketos train -d cuda -f binary dataset.arrow

A better configuration for large and complicated datasets such as handwritten texts:

$ ketos train --augment --workers 4 -d cuda -f binary --min-epochs 20 -w 0 -s '[1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do]' -r 0.0001 dataset_large.arrow

This configuration is slower to train and often requires a couple of epochs to output any sensible text at all. Therefore we tell ketos to train for at least 20 epochs so the early stopping algorithm doesn’t prematurely interrupt the training process.

Fine Tuning

Fine tuning an existing model for another typeface or new characters is also possible with the same syntax as resuming regular training:

$ ketos train -f page -i model_best.mlmodel syr/*.xml

The caveat is that the alphabet of the base model and training data have to be an exact match. Otherwise an error will be raised:

$ ketos train -i model_5.mlmodel kamil/*.png
Building training set  [####################################]  100%
Building validation set  [####################################]  100%
[0.8616] alphabet mismatch {'~', '»', '8', '9', 'ـ'}
Network codec not compatible with training set
[0.8620] Training data and model codec alphabets mismatch: {'ٓ', '؟', '!', 'ص', '،', 'ذ', 'ة', 'ي', 'و', 'ب', 'ز', 'ح', 'غ', '~', 'ف', ')', 'د', 'خ', 'م', '»', 'ع', 'ى', 'ق', 'ش', 'ا', 'ه', 'ك', 'ج', 'ث', '(', 'ت', 'ظ', 'ض', 'ل', 'ط', '؛', 'ر', 'س', 'ن', 'ء', 'ٔ', '«', 'ـ', 'ٕ'}

There are two modes dealing with mismatching alphabets, union and new. union resizes the output layer and codec of the loaded model to include all characters in the new training set without removing any characters. new will make the resulting model an exact match with the new training set by both removing unused characters from the model and adding new ones.

$ ketos -v train --resize union -i model_5.mlmodel syr/*.png
...
[0.7943] Training set 788 lines, validation set 88 lines, alphabet 50 symbols
...
[0.8337] Resizing codec to include 3 new code points
[0.8374] Resizing last layer in network to 52 outputs
...

In this example 3 characters were added for a network that is able to recognize 52 different characters after sufficient additional training.

$ ketos -v train --resize new -i model_5.mlmodel syr/*.png
...
[0.7593] Training set 788 lines, validation set 88 lines, alphabet 49 symbols
...
[0.7857] Resizing network or given codec to 49 code sequences
[0.8344] Deleting 2 output classes from network (46 retained)
...

In new mode 2 of the original characters were removed and 3 new ones were added.

Slicing

Refining on mismatched alphabets has its limits. If the alphabets are highly different the modification of the final linear layer to add/remove character will destroy the inference capabilities of the network. In those cases it is faster to slice off the last few layers of the network and only train those instead of a complete network from scratch.

Taking the default network definition as printed in the debug log we can see the layer indices of the model:

[0.8760] Creating new model [1,48,0,1 Cr3,3,32 Do0.1,2 Mp2,2 Cr3,3,64 Do0.1,2 Mp2,2 S1(1x12)1,3 Lbx100 Do] with 48 outputs
[0.8762] layer          type    params
[0.8790] 0              conv    kernel 3 x 3 filters 32 activation r
[0.8795] 1              dropout probability 0.1 dims 2
[0.8797] 2              maxpool kernel 2 x 2 stride 2 x 2
[0.8802] 3              conv    kernel 3 x 3 filters 64 activation r
[0.8804] 4              dropout probability 0.1 dims 2
[0.8806] 5              maxpool kernel 2 x 2 stride 2 x 2
[0.8813] 6              reshape from 1 1 x 12 to 1/3
[0.8876] 7              rnn     direction b transposed False summarize False out 100 legacy None
[0.8878] 8              dropout probability 0.5 dims 1
[0.8883] 9              linear  augmented False out 48

To remove everything after the initial convolutional stack and add untrained layers we define a network stub and index for appending:

$ ketos train -i model_1.mlmodel --append 7 -s '[Lbx256 Do]' syr/*.png
Building training set  [####################################]  100%
Building validation set  [####################################]  100%
[0.8014] alphabet mismatch {'8', '3', '9', '7', '܇', '݀', '݂', '4', ':', '0'}
Slicing and dicing model ✓

The new model will behave exactly like a new one, except potentially training a lot faster.

Text Normalization and Unicode

Text can be encoded in multiple different ways when using Unicode. For many scripts characters with diacritics can be encoded either as a single code point or a base character and the diacritic, different types of whitespace exist, and mixed bidirectional text can be written differently depending on the base line direction.

Ketos provides options to largely normalize input into normalized forms that make processing of data from multiple sources possible. Principally, two options are available: one for Unicode normalization and one for whitespace normalization. The Unicode normalization (disabled per default) switch allows one to select one of the 4 normalization forms:

$ ketos train --normalization NFD -f xml training_data/*.xml
$ ketos train --normalization NFC -f xml training_data/*.xml
$ ketos train --normalization NFKD -f xml training_data/*.xml
$ ketos train --normalization NFKC -f xml training_data/*.xml

Whitespace normalization is enabled per default and converts all Unicode whitespace characters into a simple space. It is highly recommended to leave this function enabled as the variation of space width, resulting either from text justification or the irregularity of handwriting, is difficult for a recognition model to accurately model and map onto the different space code points. Nevertheless it can be disabled through:

$ ketos train --no-normalize-whitespace -f xml training_data/*.xml

Further the behavior of the BiDi algorithm can be influenced through two options. The configuration of the algorithm is important as the recognition network is trained to output characters (or rather labels which are mapped to code points by a codec) in the order a line is fed into the network, i.e. left-to-right also called display order. Unicode text is encoded as a stream of code points in logical order, i.e. the order the characters in a line are read in by a human reader, for example (mostly) right-to-left for a text in Hebrew. The BiDi algorithm resolves this logical order to the display order expected by the network and vice versa. The primary parameter of the algorithm is the base direction which is just the default direction of the input fields of the user when the ground truth was initially transcribed. Base direction will be automatically determined by kraken when using PAGE XML or ALTO files that contain it, otherwise it will have to be supplied if it differs from the default when training a model:

$ ketos train --base-dir R -f xml rtl_training_data/*.xml

It is also possible to disable BiDi processing completely, e.g. when the text has been brought into display order already:

$ ketos train --no-reorder -f xml rtl_display_data/*.xml

Codecs

Codecs map between the label decoded from the raw network output and Unicode code points (see this diagram for the precise steps involved in text line recognition). Codecs are attached to a recognition model and are usually defined once at initial training time, although they can be adapted either explicitly (with the API) or implicitly through domain adaptation.

The default behavior of kraken is to auto-infer this mapping from all the characters in the training set and map each code point to one separate label. This is usually sufficient for alphabetic scripts, abjads, and abugidas apart from very specialised use cases. Logographic writing systems with a very large number of different graphemes, such as all the variants of Han characters or Cuneiform, can be more problematic as their large inventory makes recognition both slow and error-prone. In such cases it can be advantageous to decompose each code point into multiple labels to reduce the output dimensionality of the network. During decoding valid sequences of labels will be mapped to their respective code points as usual.

There are multiple approaches one could follow constructing a custom codec: randomized block codes, i.e. producing random fixed-length labels for each code point, Huffmann coding, i.e. variable length label sequences depending on the frequency of each code point in some text (not necessarily the training set), or structural decomposition, i.e. describing each code point through a sequence of labels that describe the shape of the grapheme similar to how some input systems for Chinese characters function.

While the system is functional it is not well-tested in practice and it is unclear which approach works best for which kinds of inputs.

Custom codecs can be supplied as simple JSON files that contain a dictionary mapping between strings and integer sequences, e.g.:

$ ketos train -c sample.codec -f xml training_data/*.xml

with sample.codec containing:

{"S": [50, 53, 74, 23],
 "A": [95, 60, 19, 95],
 "B": [2, 96, 28, 29],
 "\u1f05": [91, 14, 95, 90]}

Unsupervised recognition pretraining

Text recognition models can be pretrained in an unsupervised fashion from text line images, both in bounding box and baseline format. The pretraining is performed through a contrastive surrogate task aiming to distinguish in-painted parts of the input image features from randomly sampled distractor slices.

All data sources accepted by the supervised trainer are valid for pretraining but for performance reasons it is recommended to use pre-compiled binary datasets. One thing to keep in mind is that compilation filters out empty (non-transcribed) text lines per default which is undesirable for pretraining. With the --keep-empty-lines option all valid lines will be written to the dataset file:

$ ketos compile --keep-empty-lines -f xml -o foo.arrow *.xml

The basic pretraining call is very similar to a training one:

$ ketos pretrain -f binary foo.arrow

There are a couple of hyperparameters that are specific to pretraining: the mask width (at the subsampling level of the last convolutional layer), the probability of a particular position being the start position of a mask, and the number of negative distractor samples.

$ ketos pretrain -o pretrain --mask-width 4 --mask-probability 0.2 --num-negatives 3 -f binary foo.arrow

Once a model has been pretrained it has to be adapted to perform actual recognition with a standard labelled dataset, although training data requirements will usually be much reduced:

$ ketos train -i pretrain_best.mlmodel --warmup 5000 --freeze-backbone 1000 -f binary labelled.arrow

It is necessary to use learning rate warmup (warmup) for at least a couple of epochs in addition to freezing the backbone (all but the last fully connected layer performing the classification) to have the model converge during fine-tuning. Fine-tuning models from pre-trained weights is quite a bit less stable than training from scratch or fine-tuning an existing model. As such it can be necessary to run a couple of trials with different hyperparameters (principally learning rate) to find workable ones. It is entirely possible that pretrained models do not converge at all even with reasonable hyperparameter configurations.

Segmentation training

Training a segmentation model is very similar to training models for text recognition. The basic invocation is:

$ ketos segtrain -f xml training_data/*.xml

This takes all text lines and regions encoded in the XML files and trains a model to recognize them.

Most other options available in transcription training are also available in segmentation training. CUDA acceleration:

$ ketos segtrain -d cuda -f xml training_data/*.xml

Defining custom architectures:

$ ketos segtrain -d cuda -s '[1,1200,0,3 Cr7,7,64,2,2 Gn32 Cr3,3,128,2,2 Gn32 Cr3,3,128 Gn32 Cr3,3,256 Gn32]' -f xml training_data/*.xml

Fine tuning/transfer learning with last layer adaptation and slicing:

$ ketos segtrain --resize new -i segmodel_best.mlmodel training_data/*.xml
$ ketos segtrain -i segmodel_best.mlmodel --append 7 -s '[Cr3,3,64 Do0.1]' training_data/*.xml

In addition there are a couple of specific options that allow filtering of baseline and region types. Datasets are often annotated to a level that is too detailed or contains undesirable types, e.g. when combining segmentation data from different sources. The most basic option is the suppression of all of either baseline or region data contained in the dataset:

$ ketos segtrain --suppress-baselines -f xml training_data/*.xml
Training line types:
Training region types:
  graphic       3       135
  text  4       1128
  separator     5       5431
  paragraph     6       10218
  table 7       16
...
$ ketos segtrain --suppress-regions -f xml training-data/*.xml
Training line types:
  default 2     53980
  foo     8     134
...

It is also possible to filter out baselines/regions selectively:

$ ketos segtrain -f xml --valid-baselines default training_data/*.xml
Training line types:
  default 2     53980
Training region types:
  graphic       3       135
  text  4       1128
  separator     5       5431
  paragraph     6       10218
  table 7       16
$ ketos segtrain -f xml --valid-regions graphic --valid-regions paragraph training_data/*.xml
Training line types:
  default 2     53980
 Training region types:
  graphic       3       135
  paragraph     6       10218

Finally, we can merge baselines and regions into each other:

$ ketos segtrain -f xml --merge-baselines default:foo training_data/*.xml
Training line types:
  default 2     54114
...
$ ketos segtrain -f xml --merge-regions text:paragraph --merge-regions graphic:table training_data/*.xml
...
Training region types:
  graphic       3       151
  text  4       11346
  separator     5       5431
...

These options are combinable to massage the dataset into any typology you want. Tags containing the separator character : can be specified by escaping them with backslash.

Then there are some options that set metadata fields controlling the postprocessing. When computing the bounding polygons the recognized baselines are offset slightly to ensure overlap with the line corpus. This offset is per default upwards for baselines but as it is possible to annotate toplines (for scripts like Hebrew) and centerlines (for baseline-free scripts like Chinese) the appropriate offset can be selected with an option:

$ ketos segtrain --topline -f xml hebrew_training_data/*.xml
$ ketos segtrain --centerline -f xml chinese_training_data/*.xml
$ ketos segtrain --baseline -f xml latin_training_data/*.xml

Lastly, there are some regions that are absolute boundaries for text line content. When these regions are marked as such the polygonization can sometimes be improved:

$ ketos segtrain --bounding-regions paragraph -f xml training_data/*.xml
...

Reading order training

Reading order models work slightly differently from segmentation and reading order models. They are closely linked to the typology used in the dataset they were trained on as they use type information on lines and regions to make ordering decisions. As the same typology was probably used to train a specific segmentation model, reading order models are trained separately but bundled with their segmentation model in a subsequent step. The general sequence is therefore:

$ ketos segtrain -o fr_manu_seg.mlmodel -f xml french/*.xml
...
$ ketos rotrain -o fr_manu_ro.mlmodel -f xml french/*.xml
...
$ ketos roadd -o fr_manu_seg_with_ro.mlmodel -i fr_manu_seg_best.mlmodel  -r fr_manu_ro_best.mlmodel

Only the fr_manu_seg_with_ro.mlmodel file will contain the trained reading order model. Segmentation models can exist with or without reading order models. If one is added, the neural reading order will be computed in addition to the one produced by the default heuristic during segmentation and serialized in the final XML output (in ALTO/PAGE XML).

Note

Reading order models work purely on the typology and geometric features of the lines and regions. They construct an approximate ordering matrix by feeding feature vectors of two lines (or regions) into the network to decide which of those two lines precedes the other.

These feature vectors are quite simple; just the lines’ types, and their start, center, and end points. Therefore they can not reliably learn any ordering relying on graphical features of the input page such as: line color, typeface, or writing system.

Reading order models are extremely simple and do not require a lot of memory or computational power to train. In fact, the default parameters are extremely conservative and it is recommended to increase the batch size for improved training speed. Large batch size above 128k are easily possible with sufficiently large training datasets:

$ ketos rotrain -o fr_manu_ro.mlmodel -B 128000 -f french/*.xml
Training RO on following baselines types:
  DefaultLine   1
  DropCapitalLine       2
  HeadingLine   3
  InterlinearLine       4
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
┏━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃   ┃ Name        ┃ Type              ┃ Params ┃
┡━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ 0 │ criterion   │ BCEWithLogitsLoss │      0 │
│ 1 │ ro_net      │ MLP               │  1.1 K │
│ 2 │ ro_net.fc1  │ Linear            │  1.0 K │
│ 3 │ ro_net.relu │ ReLU              │      0 │
│ 4 │ ro_net.fc2  │ Linear            │     45 │
└───┴─────────────┴───────────────────┴────────┘
Trainable params: 1.1 K
Non-trainable params: 0
Total params: 1.1 K
Total estimated model params size (MB): 0
stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/35 0:00:00 • -:--:-- 0.00it/s val_spearman: 0.912 val_loss: 0.701 early_stopping: 0/300 inf

During validation a metric called Spearman’s footrule is computed. To calculate Spearman’s footrule, the ranks of the lines of text in the ground truth reading order and the predicted reading order are compared. The footrule is then calculated as the sum of the absolute differences between the ranks of pairs of lines. The score increases by 1 for each line between the correct and predicted positions of a line.

A lower footrule score indicates a better alignment between the two orders. A score of 0 implies perfect alignment of line ranks.

Recognition testing

Picking a particular model from a pool or getting a more detailed look on the recognition accuracy can be done with the test command. It uses transcribed lines, the test set, in the same format as the train command, recognizes the line images with one or more models, and creates a detailed report of the differences from the ground truth for each of them.

option

action

-f, --format-type

Sets the test set data format. Valid choices are ‘path’, ‘xml’ (default), ‘alto’, ‘page’, or binary. In alto, page, and xml mode all data is extracted from XML files containing both baselines and a link to source images. In path mode arguments are image files sharing a prefix up to the last extension with JSON .path files containing the baseline information. In binary mode arguments are precompiled binary dataset files.

-m, --model

Model(s) to evaluate.

-e, --evaluation-files

File(s) with paths to evaluation data.

-d, --device

Select device to use.

--pad

Left and right padding around lines.

Transcriptions are handed to the command in the same way as for the train command, either through a manifest with -e/--evaluation-files or by just adding a number of image files as the final argument:

$ ketos test -m $model -e test.txt test/*.png
Evaluating $model
Evaluating  [####################################]  100%
=== report test_model.mlmodel ===

7012 Characters
6022 Errors
14.12%       Accuracy

5226 Insertions
2    Deletions
794  Substitutions

Count Missed   %Right
1567  575    63.31%  Common
5230  5230   0.00%   Arabic
215   215    0.00%   Inherited

Errors       Correct-Generated
773  { ا } - {  }
536  { ل } - {  }
328  { و } - {  }
274  { ي } - {  }
266  { م } - {  }
256  { ب } - {  }
246  { ن } - {  }
241  { SPACE } - {  }
207  { ر } - {  }
199  { ف } - {  }
192  { ه } - {  }
174  { ع } - {  }
172  { ARABIC HAMZA ABOVE } - {  }
144  { ت } - {  }
136  { ق } - {  }
122  { س } - {  }
108  { ، } - {  }
106  { د } - {  }
82   { ك } - {  }
81   { ح } - {  }
71   { ج } - {  }
66   { خ } - {  }
62   { ة } - {  }
60   { ص } - {  }
39   { ، } - { - }
38   { ش } - {  }
30   { ا } - { - }
30   { ن } - { - }
29   { ى } - {  }
28   { ذ } - {  }
27   { ه } - { - }
27   { ARABIC HAMZA BELOW } - {  }
25   { ز } - {  }
23   { ث } - {  }
22   { غ } - {  }
20   { م } - { - }
20   { ي } - { - }
20   { ) } - {  }
19   { : } - {  }
19   { ط } - {  }
19   { ل } - { - }
18   { ، } - { . }
17   { ة } - { - }
16   { ض } - {  }
...
Average accuracy: 14.12%, (stddev: 0.00)

The report(s) contains character accuracy measured per script and a detailed list of confusions. When evaluating multiple models the last line of the output will the average accuracy and the standard deviation across all of them.