Training¶
This page describes the training utilities available through the ketos
command line utility in depth. For a gentle introduction on model training
please refer to the tutorial.
Both segmentation and recognition are trainable in kraken. The segmentation model finds baselines and regions on a page image. Recognition models convert text image lines found by the segmenter into digital text.
Training data formats¶
The training tools accept a variety of training data formats, usually some kind of custom low level format, and the XML-based formats that are commony used for archival of annotation and transcription data. It is recommended to use the XML formats as they are interchangeable with other tools, do not incur transformation losses, and allow training all components of kraken from the same datasets easily.
ALTO¶
Kraken parses and produces files according to the upcoming version of the ALTO standard: 4.2. It validates against version 4.1 with the exception of the redefinition of the BASELINE attribute to accomodate polygonal chain baselines. An example showing the attributes necessary for segmentation and recognition training follows:
<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.loc.gov/standards/alto/ns-v4#"
xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v4# http://www.loc.gov/standards/alto/v4/alto-4-0.xsd">
<Description>
<sourceImageInformation>
<fileName>filename.jpg</fileName><!-- relative path in relation to XML location of the image file-->
</sourceImageInformation>
....
</Description>
<Layout>
<Page...>
<PrintSpace...>
<ComposedBlockType ID="block_I"
HPOS="125"
VPOS="523"
WIDTH="5234"
HEIGHT="4000"
TYPE="region_type"><!-- for textlines part of a semantic region -->
<TextBlock ID="textblock_N">
<TextLine ID="line_0"
HPOS="..."
VPOS="..."
WIDTH="..."
HEIGHT="..."
BASELINE="10 20 15 20 400 20"><!-- necessary for segmentation training -->
<String ID="segment_K"
CONTENT="word_text"><!-- necessary for recognition training. Text is retrieved from <String> and <SP> tags. Lower level glyphs are ignored. -->
...
</String>
<SP.../>
</TextLine>
</TextBlock>
</ComposedBlockType>
<TextBlock ID="textblock_M"><!-- for textlines not part of a region -->
...
</TextBlock>
</PrintSpace>
</Page>
</Layout>
</alto>
Importantly, the parser only works with measurements in the pixel domain, i.e. an unset MeasurementUnit or one with an element value of pixel. In addition, as the minimal version required for ingestion is quite new it is likely that most existing ALTO documents will not contain sufficient information to be used with kraken out of the box.
PAGE XML¶
PAGE XML is parsed and produced according to the 2019-07-15 version of the schema, although the parser is not strict and works with non-conformant output of a variety of tools.
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
<Metadata>...</Metadata>
<Page imageFilename="filename.jpg"...><!-- relative path to an image file from the location of the XML document -->
<TextRegion id="block_N"
custom="structure {type:region_type;}"><!-- region type is a free text field-->
<Coords points="10,20 500,20 400,200, 500,300, 10,300 5,80"/><!-- polygon for region boundary -->
<TextLine id="line_K">
<Baseline points="80,200 100,210, 400,198"/><!-- required for baseline segmentation training -->
<TextEquiv><Unicode>text text text</Unicode></TextEquiv><!-- only TextEquiv tags immediately below the TextLine tag are parsed for recognition training -->
<Word>
...
</TextLine>
....
</TextRegion>
<TextRegion id="textblock_M"><!-- for lines not contained in any region. TextRegions without a type are automatically assigned the 'text' type which can be filtered out for training. -->
<Coords points="0,0 0,{{ page.size[1] }} {{ page.size[0] }},{{ page.size[1] }} {{ page.size[0] }},0"/>
<TextLine>...</TextLine><!-- same as above -->
....
</TextRegion>
</Page>
</PcGts>
Recognition training¶
The training utility allows training of VGSL specified models both from scratch and from existing models. Here are its command line options:
option |
action |
---|---|
-p, –pad |
Left and right padding around lines |
-o, –output |
Output model file prefix. Defaults to model. |
-s, –spec |
VGSL spec of the network to train. CTC layer will be added automatically. default: [1,48,0,1 Cr3,3,32 Do0.1,2 Mp2,2 Cr3,3,64 Do0.1,2 Mp2,2 S1(1x12)1,3 Lbx100 Do] |
-a, –append |
Removes layers before argument and then appends spec. Only works when loading an existing model |
-i, –load |
Load existing file to continue training |
-F, –savefreq |
Model save frequency in epochs during training |
-R, –report |
Report creation frequency in epochs |
-q, –quit |
Stop condition for training. Set to early for early stopping (default) or dumb for fixed number of epochs. |
-N, –epochs |
Number of epochs to train for. Set to -1 for indefinite training. |
–lag |
Number of epochs to wait before stopping training without improvement. Only used when using early stopping. |
–min-delta |
Minimum improvement between epochs to reset early stopping. Defaults to 0.005. |
-d, –device |
Select device to use (cpu, cuda:0, cuda:1,…). GPU acceleration requires CUDA. |
–optimizer |
Select optimizer (Adam, SGD, RMSprop). |
-r, –lrate |
Learning rate [default: 0.001] |
-m, –momentum |
Momentum used with SGD optimizer. Ignored otherwise. |
-w, –weight-decay |
Weight decay. |
–schedule |
Sets the learning rate scheduler. May be either constant or 1cycle. For 1cycle the cycle length is determined by the –epoch option. |
-p, –partition |
Ground truth data partition ratio between train/validation set |
-u, –normalization |
Ground truth Unicode normalization. One of NFC, NFKC, NFD, NFKD. |
-c, –codec |
Load a codec JSON definition (invalid if loading existing model) |
–resize |
Codec/output layer resizing option. If set to add code points will be added, both will set the layer to match exactly the training data, fail will abort if training data and model codec do not match. Only valid when refining an existing model. |
-n, –reorder / –no-reorder |
Reordering of code points to display order. |
-t, –training-files |
File(s) with additional paths to training data. Used to enforce an explicit train/validation set split and deal with training sets with more lines than the command line can process. Can be used more than once. |
-e, –evaluation-files |
File(s) with paths to evaluation data. Overrides the -p parameter. |
–preload / –no-preload |
Hard enable/disable for training data preloading. Preloading training data into memory is enabled per default for sets with less than 2500 lines. |
–threads |
Number of OpenMP threads when running on CPU. Defaults to min(4, #cores). |
From Scratch¶
The absolute minimal example to train a new recognition model from a number of PAGE XML documents is similar to the segmentation training:
$ ketos train training_data/*.png
Training will continue until the error does not improve anymore and the best model (among intermediate results) will be saved in the current directory.
In some cases, such as color inputs, changing the network architecture might be useful:
$ ketos train -f page -s '[1,0,0,3 Cr3,3,16 Mp3,3 Lfys64 Lbx128 Lbx256 Do]' syr/*.xml
Complete documentation for the network description language can be found on the VGSL page.
Sometimes the early stopping default parameters might produce suboptimal results such as stopping training too soon. Adjusting the minimum delta an/or lag can be useful:
$ ketos train --lag 10 --min-delta 0.001 syr/*.png
To switch optimizers from Adam to SGD or RMSprop just set the option:
$ ketos train --optimizer SGD syr/*.png
It is possible to resume training from a previously saved model:
$ ketos train -i model_25.mlmodel syr/*.png
Fine Tuning¶
Fine tuning an existing model for another typeface or new characters is also possible with the same syntax as resuming regular training:
$ ketos train -f page -i model_best.mlmodel syr/*.xml
The caveat is that the alphabet of the base model and training data have to be an exact match. Otherwise an error will be raised:
$ ketos train -i model_5.mlmodel --no-preload kamil/*.png
Building training set [####################################] 100%
Building validation set [####################################] 100%
[0.8616] alphabet mismatch {'~', '»', '8', '9', 'ـ'}
Network codec not compatible with training set
[0.8620] Training data and model codec alphabets mismatch: {'ٓ', '؟', '!', 'ص', '،', 'ذ', 'ة', 'ي', 'و', 'ب', 'ز', 'ح', 'غ', '~', 'ف', ')', 'د', 'خ', 'م', '»', 'ع', 'ى', 'ق', 'ش', 'ا', 'ه', 'ك', 'ج', 'ث', '(', 'ت', 'ظ', 'ض', 'ل', 'ط', '؛', 'ر', 'س', 'ن', 'ء', 'ٔ', '«', 'ـ', 'ٕ'}
There are two modes dealing with mismatching alphabets, add
and both
.
add
resizes the output layer and codec of the loaded model to include all
characters in the new training set without removing any characters. both
will make the resulting model an exact match with the new training set by both
removing unused characters from the model and adding new ones.
$ ketos -v train --resize add -i model_5.mlmodel syr/*.png
...
[0.7943] Training set 788 lines, validation set 88 lines, alphabet 50 symbols
...
[0.8337] Resizing codec to include 3 new code points
[0.8374] Resizing last layer in network to 52 outputs
...
In this example 3 characters were added for a network that is able to recognize 52 different characters after sufficient additional training.
$ ketos -v train --resize both -i model_5.mlmodel syr/*.png
...
[0.7593] Training set 788 lines, validation set 88 lines, alphabet 49 symbols
...
[0.7857] Resizing network or given codec to 49 code sequences
[0.8344] Deleting 2 output classes from network (46 retained)
...
In both
mode 2 of the original characters were removed and 3 new ones were added.
Slicing¶
Refining on mismatched alphabets has its limits. If the alphabets are highly different the modification of the final linear layer to add/remove character will destroy the inference capabilities of the network. In those cases it is faster to slice off the last few layers of the network and only train those instead of a complete network from scratch.
Taking the default network definition as printed in the debug log we can see the layer indices of the model:
[0.8760] Creating new model [1,48,0,1 Cr3,3,32 Do0.1,2 Mp2,2 Cr3,3,64 Do0.1,2 Mp2,2 S1(1x12)1,3 Lbx100 Do] with 48 outputs
[0.8762] layer type params
[0.8790] 0 conv kernel 3 x 3 filters 32 activation r
[0.8795] 1 dropout probability 0.1 dims 2
[0.8797] 2 maxpool kernel 2 x 2 stride 2 x 2
[0.8802] 3 conv kernel 3 x 3 filters 64 activation r
[0.8804] 4 dropout probability 0.1 dims 2
[0.8806] 5 maxpool kernel 2 x 2 stride 2 x 2
[0.8813] 6 reshape from 1 1 x 12 to 1/3
[0.8876] 7 rnn direction b transposed False summarize False out 100 legacy None
[0.8878] 8 dropout probability 0.5 dims 1
[0.8883] 9 linear augmented False out 48
To remove everything after the initial convolutional stack and add untrained layers we define a network stub and index for appending:
$ ketos train -i model_1.mlmodel --append 7 -s '[Lbx256 Do]' syr/*.png
Building training set [####################################] 100%
Building validation set [####################################] 100%
[0.8014] alphabet mismatch {'8', '3', '9', '7', '܇', '݀', '݂', '4', ':', '0'}
Slicing and dicing model ✓
The new model will behave exactly like a new one, except potentially training a lot faster.
Segmentation training¶
Training a segmentation model is very similar to training one for
Testing¶
Picking a particular model from a pool or getting a more detailled look on the recognition accuracy can be done with the test command. It uses transcribed lines, the test set, in the same format as the train command, recognizes the line images with one or more models, and creates a detailled report of the differences from the ground truth for each of them.
- -m, --model
Model(s) to evaluate.
- -e, --evaluation-files
File(s) with paths to evaluation data.
- -d, --device
Select device to use.
- -p, --pad
Left and right padding around lines.
Transcriptions are handed to the command in the same way as for the train command, either through a manifest with -e/–evaluation-files or by just adding a number of image files as the final argument:
$ ketos test -m $model -e test.txt test/*.png
Evaluating $model
Evaluating [####################################] 100%
=== report test_model.mlmodel ===
7012 Characters
6022 Errors
14.12% Accuracy
5226 Insertions
2 Deletions
794 Substitutions
Count Missed %Right
1567 575 63.31% Common
5230 5230 0.00% Arabic
215 215 0.00% Inherited
Errors Correct-Generated
773 { ا } - { }
536 { ل } - { }
328 { و } - { }
274 { ي } - { }
266 { م } - { }
256 { ب } - { }
246 { ن } - { }
241 { SPACE } - { }
207 { ر } - { }
199 { ف } - { }
192 { ه } - { }
174 { ع } - { }
172 { ARABIC HAMZA ABOVE } - { }
144 { ت } - { }
136 { ق } - { }
122 { س } - { }
108 { ، } - { }
106 { د } - { }
82 { ك } - { }
81 { ح } - { }
71 { ج } - { }
66 { خ } - { }
62 { ة } - { }
60 { ص } - { }
39 { ، } - { - }
38 { ش } - { }
30 { ا } - { - }
30 { ن } - { - }
29 { ى } - { }
28 { ذ } - { }
27 { ه } - { - }
27 { ARABIC HAMZA BELOW } - { }
25 { ز } - { }
23 { ث } - { }
22 { غ } - { }
20 { م } - { - }
20 { ي } - { - }
20 { ) } - { }
19 { : } - { }
19 { ط } - { }
19 { ل } - { - }
18 { ، } - { . }
17 { ة } - { - }
16 { ض } - { }
...
Average accuracy: 14.12%, (stddev: 0.00)
The report(s) contains character accuracy measured per script and a detailled list of confusions. When evaluating multiple models the last line of the output will the average accuracy and the standard deviation across all of them.