Training¶
This page describes the training utilities available through the ketos
command line utility in depth. For a gentle introduction on model training
please refer to the tutorial.
Thanks to the magic of Connectionist Temporal Classification prerequisites for creating a
new recognition model are quite modest. The basic requirement is a number of
text lines (ground truth
) that correspond to line images and some time for
training.
Transcription¶
Transcription is done through local browser based HTML transcription
environments. These are created by the ketos transcribe
command line util.
Its basic input is just a number of image files and an output path to write the
HTML file to:
$ ketos transcribe -o output.html image_1.png image_2.png ...
While it is possible to put multiple images into a single transcription environment splitting into one-image-per-HTML will ease parallel transcription by multiple people.
The above command reads in the image files, converts them to black and white, tries to split them into line images, and puts an editable text field next to the image in the HTML. There are a handful of option changing the output:
option |
action |
---|---|
-d, –text-direction |
Sets the principal text direction both for the segmenter and in the HTML. Can be one of horizontal-lr, horizontal-rl, vertical-lr, vertical-rl. |
–scale |
A segmenter parameter giving an estimate of average line height. Usually it shouldn’t be set manually. |
–bw / –orig |
Disables binarization of input images. If color or grayscale training data is desired this option has to be set. |
-m, –maxcolseps |
A segmenter parameter limiting the number of columns that can be found in the input image by setting the maximum number of column separators. Set to 0 to disable column detection. |
-b, –black_colseps / -w, –white_colseps |
A segmenter parameter selecting white or black column separators. |
-f, –font |
The font family to use for rendering the text in the HTML. |
-fs, –font-style |
The font style to use in the HTML. |
-p, –prefill |
A model to use for prefilling the transcription. (Optional) |
-o, –output |
Output HTML file. |
It is possible to use an existing model to prefill the transcription environments:
$ ketos transcribe -p ~/arabic.mlmodel -p output.html image_1.png image_2.png ...
Transcription has to be diplomatic, i.e. contain the exact character sequence in the line image, including original orthography. Some deviations, such as consistently omitting vocalization in Arabic texts, is possible as long as they are systematic and relatively minor.
After transcribing a number of lines the results have to be saved, either using
the Download
button on the lower right or through the regular Save Page
As
function of the browser. All the work done is contained directly in the
saved files and it is possible to save partially transcribed files and continue
work later.
Next the contents of the filled transcription environments have to be
extracted through the ketos extract
command:
$ ketos extract --output output_directory *.html
There are some options dealing with color images and text normalization:
option |
action |
---|---|
-b, –binarize / –no-binarize |
Binarizes color/grayscale images (default) or retains the original in the output. |
-u, –normalization |
Normalizes text to one of the following Unicode normalization forms: NFD, NFKD, NFC, NFKC |
-s, –normalize-whitespace / –no-normalize-whitespace |
Normalizes whitespace in extracted text. There are several different Unicode whitespace characters that are replaced by a standard space when not disabled. |
–reorder / –no-reorder |
Tells ketos to reorder the code
point for each line into
left-to-right order. Unicode
code points are always in
reading order, e.g. the first
code point in an Arabic line
will be the rightmost
character. This option reorders
them into |
-r, –rotate / –no-rotate |
Skips rotation of vertical lines. |
-o, –output |
Output directory, defaults to |
The result will be a directory filled with line image text pairs NNNNNN.png
and NNNNNN.gt.txt
and a manifest.txt
containing a list of all extracted
lines.
Training¶
The training utility allows training of VGSL specified models
both from scratch and from existing models. Training data is in all cases just
a directory containing image-text file pairs as produced by the
transcribe/extract
tools. Here are its command line options:
option |
action |
---|---|
-p, –pad |
Left and right padding around lines |
-o, –output |
Output model file prefix. Defaults to model. |
-s, –spec |
VGSL spec of the network to train. CTC layer will be added automatically. default: [1,48,0,1 Cr3,3,32 Do0.1,2 Mp2,2 Cr3,3,64 Do0.1,2 Mp2,2 S1(1x12)1,3 Lbx100 Do] |
-a, –append |
Removes layers before argument and then appends spec. Only works when loading an existing model |
-i, –load |
Load existing file to continue training |
-F, –savefreq |
Model save frequency in epochs during training |
-R, –report |
Report creation frequency in epochs |
-q, –quit |
Stop condition for training. Set to early for early stopping (default) or dumb for fixed number of epochs. |
-N, –epochs |
Number of epochs to train for. Set to -1 for indefinite training. |
–lag |
Number of epochs to wait before stopping training without improvement. Only used when using early stopping. |
–min-delta |
Minimum improvement between epochs to reset early stopping. Defaults to 0.005. |
-d, –device |
Select device to use (cpu, cuda:0, cuda:1,…). GPU acceleration requires CUDA. |
–optimizer |
Select optimizer (Adam, SGD, RMSprop). |
-r, –lrate |
Learning rate [default: 0.001] |
-m, –momentum |
Momentum used with SGD optimizer. Ignored otherwise. |
-w, –weight-decay |
Weight decay. |
–schedule |
Sets the learning rate scheduler. May be either constant or 1cycle. For 1cycle the cycle length is determined by the –epoch option. |
-p, –partition |
Ground truth data partition ratio between train/validation set |
-u, –normalization |
Ground truth Unicode normalization. One of NFC, NFKC, NFD, NFKD. |
-c, –codec |
Load a codec JSON definition (invalid if loading existing model) |
–resize |
Codec/output layer resizing option. If set to add code points will be added, both will set the layer to match exactly the training data, fail will abort if training data and model codec do not match. Only valid when refining an existing model. |
-n, –reorder / –no-reorder |
Reordering of code points to display order. |
-t, –training-files |
File(s) with additional paths to training data. Used to enforce an explicit train/validation set split and deal with training sets with more lines than the command line can process. Can be used more than once. |
-e, –evaluation-files |
File(s) with paths to evaluation data. Overrides the -p parameter. |
–preload / –no-preload |
Hard enable/disable for training data preloading. Preloading training data into memory is enabled per default for sets with less than 2500 lines. |
–threads |
Number of OpenMP threads when running on CPU. Defaults to min(4, #cores). |
From Scratch¶
The absolut minimal example to train a new model is:
$ ketos train training_data/*.png
Training will continue until the error does not improve anymore and the best model (among intermediate results) will be saved in the current directory.
In some cases, such as color inputs, changing the network architecture might be useful:
$ ketos train -s '[1,0,0,3 Cr3,3,16 Mp3,3 Lfys64 Lbx128 Lbx256 Do]' syr/*.png
Complete documentation for the network description language can be found on the VGSL page.
Sometimes the early stopping default parameters might produce suboptimal results such as stopping training too soon. Adjusting the minimum delta an/or lag can be useful:
$ ketos train --lag 10 --min-delta 0.001 syr/*.png
To switch optimizers from Adam to SGD or RMSprop just set the option:
$ ketos train --optimizer SGD syr/*.png
It is possible to resume training from a previously saved model:
$ ketos train -i model_25.mlmodel syr/*.png
Fine Tuning¶
Fine tuning an existing model for another typeface or new characters is also possible with the same syntax as resuming regular training:
$ ketos train -i model_best.mlmodel syr/*.png
The caveat is that the alphabet of the base model and training data have to be an exact match. Otherwise an error will be raised:
$ ketos train -i model_5.mlmodel --no-preload kamil/*.png
Building training set [####################################] 100%
Building validation set [####################################] 100%
[0.8616] alphabet mismatch {'~', '»', '8', '9', 'ـ'}
Network codec not compatible with training set
[0.8620] Training data and model codec alphabets mismatch: {'ٓ', '؟', '!', 'ص', '،', 'ذ', 'ة', 'ي', 'و', 'ب', 'ز', 'ح', 'غ', '~', 'ف', ')', 'د', 'خ', 'م', '»', 'ع', 'ى', 'ق', 'ش', 'ا', 'ه', 'ك', 'ج', 'ث', '(', 'ت', 'ظ', 'ض', 'ل', 'ط', '؛', 'ر', 'س', 'ن', 'ء', 'ٔ', '«', 'ـ', 'ٕ'}
There are two modes dealing with mismatching alphabets, add
and both
.
add
resizes the output layer and codec of the loaded model to include all
characters in the new training set without removing any characters. both
will make the resulting model an exact match with the new training set by both
removing unused characters from the model and adding new ones.
$ ketos -v train --resize add -i model_5.mlmodel syr/*.png
...
[0.7943] Training set 788 lines, validation set 88 lines, alphabet 50 symbols
...
[0.8337] Resizing codec to include 3 new code points
[0.8374] Resizing last layer in network to 52 outputs
...
In this example 3 characters were added for a network that is able to recognize 52 different characters after sufficient additional training.
$ ketos -v train --resize both -i model_5.mlmodel syr/*.png
...
[0.7593] Training set 788 lines, validation set 88 lines, alphabet 49 symbols
...
[0.7857] Resizing network or given codec to 49 code sequences
[0.8344] Deleting 2 output classes from network (46 retained)
...
In both
mode 2 of the original characters were removed and 3 new ones were added.
Slicing¶
Refining on mismatched alphabets has its limits. If the alphabets are highly different the modification of the final linear layer to add/remove character will destroy the inference capabilities of the network. In those cases it is faster to slice off the last few layers of the network and only train those instead of a complete network from scratch.
Taking the default network definition as printed in the debug log we can see the layer indices of the model:
[0.8760] Creating new model [1,48,0,1 Cr3,3,32 Do0.1,2 Mp2,2 Cr3,3,64 Do0.1,2 Mp2,2 S1(1x12)1,3 Lbx100 Do] with 48 outputs
[0.8762] layer type params
[0.8790] 0 conv kernel 3 x 3 filters 32 activation r
[0.8795] 1 dropout probability 0.1 dims 2
[0.8797] 2 maxpool kernel 2 x 2 stride 2 x 2
[0.8802] 3 conv kernel 3 x 3 filters 64 activation r
[0.8804] 4 dropout probability 0.1 dims 2
[0.8806] 5 maxpool kernel 2 x 2 stride 2 x 2
[0.8813] 6 reshape from 1 1 x 12 to 1/3
[0.8876] 7 rnn direction b transposed False summarize False out 100 legacy None
[0.8878] 8 dropout probability 0.5 dims 1
[0.8883] 9 linear augmented False out 48
To remove everything after the initial convolutional stack and add untrained layers we define a network stub and index for appending:
$ ketos train -i model_1.mlmodel --append 7 -s '[Lbx256 Do]' syr/*.png
Building training set [####################################] 100%
Building validation set [####################################] 100%
[0.8014] alphabet mismatch {'8', '3', '9', '7', '܇', '݀', '݂', '4', ':', '0'}
Slicing and dicing model ✓
The new model will behave exactly like a new one, except potentially training a lot faster.
Testing¶
Picking a particular model from a pool or getting a more detailled look on the recognition accuracy can be done with the test command. It uses transcribed lines, the test set, in the same format as the train command, recognizes the line images with one or more models, and creates a detailled report of the differences from the ground truth for each of them.
- -m, --model
Model(s) to evaluate.
- -e, --evaluation-files
File(s) with paths to evaluation data.
- -d, --device
Select device to use.
- -p, --pad
Left and right padding around lines.
Transcriptions are handed to the command in the same way as for the train command, either through a manifest with -e/–evaluation-files or by just adding a number of image files as the final argument:
$ ketos test -m $model -e test.txt test/*.png
Evaluating $model
Evaluating [####################################] 100%
=== report test_model.mlmodel ===
7012 Characters
6022 Errors
14.12% Accuracy
5226 Insertions
2 Deletions
794 Substitutions
Count Missed %Right
1567 575 63.31% Common
5230 5230 0.00% Arabic
215 215 0.00% Inherited
Errors Correct-Generated
773 { ا } - { }
536 { ل } - { }
328 { و } - { }
274 { ي } - { }
266 { م } - { }
256 { ب } - { }
246 { ن } - { }
241 { SPACE } - { }
207 { ر } - { }
199 { ف } - { }
192 { ه } - { }
174 { ع } - { }
172 { ARABIC HAMZA ABOVE } - { }
144 { ت } - { }
136 { ق } - { }
122 { س } - { }
108 { ، } - { }
106 { د } - { }
82 { ك } - { }
81 { ح } - { }
71 { ج } - { }
66 { خ } - { }
62 { ة } - { }
60 { ص } - { }
39 { ، } - { - }
38 { ش } - { }
30 { ا } - { - }
30 { ن } - { - }
29 { ى } - { }
28 { ذ } - { }
27 { ه } - { - }
27 { ARABIC HAMZA BELOW } - { }
25 { ز } - { }
23 { ث } - { }
22 { غ } - { }
20 { م } - { - }
20 { ي } - { - }
20 { ) } - { }
19 { : } - { }
19 { ط } - { }
19 { ل } - { - }
18 { ، } - { . }
17 { ة } - { - }
16 { ض } - { }
...
Average accuracy: 14.12%, (stddev: 0.00)
The report(s) contains character accuracy measured per script and a detailled list of confusions. When evaluating multiple models the last line of the output will the average accuracy and the standard deviation across all of them.
Artificial Training Data¶
It is possible to rely on artificially created training data, instead of laborously creating ground truth by manual means. A proper typeface and some text in the target language will be needed.
For many popular historical fonts there are free reproductions which quite closely match printed editions. Most are available in your distribution’s
repositories and often shipped with TeX Live.
Some good places to start for non-Latin scripts are:
Amiri, a classical Arabic typeface by Khaled Hosny
The Greek Font Society offers freely licensed (historical) typefaces for polytonic Greek.
The friendly religious fanatics from SIL assemble a wide variety of fonts for non-Latin scripts.
Next we need some text to generate artificial line images from. It should be a typical example of the type of printed works you want to recognize and at least 500-1000 lines in length.
A minimal invocation to the line generation tool will look like this:
$ ketos linegen -f Amiri da1.txt da2.txt
Reading texts ✓
Read 3692 unique lines
Σ (len: 99)
Symbols: !(),-./0123456789:ABEFGHILMNPRS[]_acdefghiklmnoprstuvyz«»،؟ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىيپ
Writing images ✓
The output will be written to a directory called training_data
, although
this may be changed using the -o
option. Each text line is rendered using
the Amiri typeface.
Alphabet and Normalization¶
Let’s take a look at important information in the preamble:
Read 3692 unique lines
Σ (len: 99)
Symbols: !(),-./0123456789:ABEFGHILMNPRS[]_acdefghiklmnoprstuvyz«»،؟ﺀﺁﺃﺅﺈﺋﺎﺑﺔﺘﺜﺠﺤﺧﺩﺫﺭﺰﺴﺸﺼﻀﻄﻈﻌﻐـﻔﻘﻜﻠﻤﻨﻫﻭﻰﻳپ
ketos tells us that it found 3692 unique lines which contained 99 different
symbols
or code points
. We can see the training data contains all of
the Arabic script including accented precomposed characters, but only a subset
of Latin characters, numerals, and punctuation. A trained model will be able to
recognize only these exact symbols, e.g. a C
or j
on the page will
never be recognized. Either accept this limitation or add additional text lines
to the training corpus until the alphabet matches your needs.
We can also force a normalization form using the -u
option; per default
none is applied. For example:
$ ketos linegen -u NFD -f "GFS Philostratos" grc.txt
Reading texts ✓
Read 2860 unique lines
Σ (len: 132)
Symbols: #&'()*,-./0123456789:;ABCDEGHILMNOPQRSTVWXZ]abcdefghiklmnopqrstuvxy §·ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρςστυφχψω—‘’“
Combining Characters: COMBINING GRAVE ACCENT, COMBINING ACUTE ACCENT, COMBINING DIAERESIS, COMBINING COMMA ABOVE, COMBINING REVERSED COMMA ABOVE, COMBINING DOT BELOW, COMBINING GREEK PERISPOMENI, COMBINING GREEK YPOGEGRAMMENI
$ ketos linegen -u NFC -f "GFS Philostratos" grc.txt
Reading texts ✓
Read 2860 unique lines
Σ (len: 231)
Symbols: #&'()*,-./0123456789:;ABCDEGHILMNOPQRSTVWXZ]abcdefghiklmnopqrstuvxy §·ΐΑΒΓΔΕΖΘΙΚΛΜΝΞΟΠΡΣΤΦΧΨΩάέήίαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώἀἁἂἃἄἅἈἌἎἐἑἓἔἕἘἙἜἝἠἡἢἣἤἥἦἧἩἭἮἰἱἳἴἵἶἷἸἹἼὀὁὂὃὄὅὈὉὌὐὑὓὔὕὖὗὙὝὠὡὢὤὥὦὧὨὩὰὲὴὶὸὺὼᾄᾐᾑᾔᾗᾠᾤᾧᾳᾶᾷῃῄῆῇῒῖῥῦῬῳῴῶῷ—‘’“
Combining Characters: COMBINING ACUTE ACCENT, COMBINING DOT BELOW
While there hasn’t been any study on the effect of different normalizations on recognition accuracy there are some benefits to NFD, namely decreased model size and easier validation of the alphabet.
Other Parameters¶
Sometimes it is desirable to draw a certain number of lines randomly from one
or more large texts. The -n
option does just that:
$ ketos linegen -u NFD -n 100 -f Amiri da1.txt da2.txt da3.txt da4.txt
Reading texts ✓
Read 114265 unique lines
Sampling 100 lines ✓
Σ (len: 64)
Symbols: !(),-./0123456789:[]{}«»،؛؟ءابةتثجحخدذرزسشصضطظعغـفقكلمنهوىي–
Combining Characters: ARABIC MADDAH ABOVE, ARABIC HAMZA ABOVE, ARABIC HAMZA BELOW
Writing images ⢿
It is also possible to adjust to amount of degradation/distortion of line
images by using the -s/-r/-d/-ds
switches:
$ ketos linegen -m 0.2 -s 0.002 -r 0.001 -d 3 Downloads/D/A/da1.txt
Reading texts ✓
Read 859 unique lines
Σ (len: 46)
Symbols: !"-.:،؛؟ءآأؤإئابةتثجحخدذرزسشصضطظعغفقكلمنهوىي
Writing images ⣽
Sometimes the shaping engine misbehaves using some fonts (notably GFS
Philostratos
) by rendering texts in certain normalizations incorrectly if the
font does not contain glyphs for decomposed characters. One sign are misplaced
diacritics and glyphs in different fonts. A workaround is renormalizing the
text for rendering purposes (here to NFC):
$ ketos linegen -ur NFC -u NFD -f "GFS Philostratos" grc.txt