Training a kraken model

kraken is an optical character recognition package that can be trained fairly easily for a large number of scripts. In contrast to other system requiring segmentation down to glyph level before classification, it is uniquely suited for the recognition of connected scripts, because the neural network is trained to assign correct character to unsegmented training data.

Training a new model for kraken requires a variable amount of training data manually generated from page images which have to be typographically similar to the target prints that are to be recognized. As the system works on unsegmented inputs for both training and recognition and its base unit is a text line, training data are just transcriptions aligned to line images.

Installing kraken

The easiest way to install and use kraken is through conda. kraken works both on Linux and Mac OS X. After installing conda, download the environment file and create the environment for kraken:

$ wget https://raw.githubusercontent.com/mittagessen/kraken/master/environment.yml
$ conda env create -f environment.yml

Each time you want to use the kraken environment in a shell is has to be activated first:

$ conda activate kraken

Image acquisition and preprocessing

First a number of high quality scans, preferably color or grayscale and at least 300dpi are required. Scans should be in a lossless image format such as TIFF or PNG, images in PDF files have to be extracted beforehand using a tool such as pdftocairo or pdfimages. While each of these requirements can be relaxed to a degree, the final accuracy will suffer to some extent. For example, only slightly compressed JPEG scans are generally suitable for training and recognition.

Depending on the source of the scans some preprocessing such as splitting scans into pages, correcting skew and warp, and removing speckles is usually required. For complex layouts such as newspapers it is advisable to split the page manually into columns as the line extraction algorithm run to create transcription environments does not deal well with non-codex page layouts. A fairly user-friendly software for semi-automatic batch processing of image scans is Scantailor albeit most work can be done using a standard image editor.

The total number of scans required depends on the nature of the script to be recognized. Only features that are found on the page images and training data derived from it can later be recognized, so it is important that the coverage of typographic features is exhaustive. Training a single script model for a fairly small script such as Arabic or Hebrew requires at least 800 lines, while multi-script models, e.g. combined polytonic Greek and Latin, will require significantly more transcriptions.

There is no hard rule for the amount of training data and it may be required to retrain a model after the initial training data proves insufficient. Most western texts contain between 25 and 40 lines per page, therefore upward of 30 pages have to be preprocessed and later transcribed.

Transcription

Transcription is done through local browser based HTML transcription environments. These are created by the ketos transcribe command line util that is part of kraken. Its basic input is just a number of image files and an output path to write the HTML file to:

$ ketos transcribe -o output.html image_1.png image_2.png ...

While it is possible to put multiple images into a single transcription environment splitting into one-image-per-HTML will ease parallel transcription by multiple people.

The above command reads in the image files, converts them to black and white if necessary, tries to split them into line images, and puts an editable text field next to the image in the HTML.

Transcription has to be diplomatic, i.e. contain the exact character sequence in the line image, including original orthography. Some deviations, such as consistently omitting vocalization in Arabic texts, is possible as long as they are systematic and relatively minor.

Note

The page segmentation algorithm extracting lines from images is optimized for western page layouts and may recognize lines erroneously, lumping multiple lines together or cutting them in half. The most efficient way to deal with these errors is just skipping the affected lines by leaving the text box empty.

Tip

Copy-paste transcription can significantly speed up the whole process. Either transcribe scans of a work where a digital edition already exists (but does not for typographically similar prints) or find a sufficiently similar edition as a base.

After transcribing a number of lines the results have to be saved, either using the Download button on the lower left or through the regular Save Page As (CTRL+S) function of the browser. All the work done is contained directly in the saved files and it is possible to save partially transcribed files and continue work later.

Next the contents of the filled transcription environments have to be extracted through the ketos extract command:

$ ketos extract --output output_directory --normalization NFD *.html

with

--output

The output directory where all line image-text pairs (training data) are written, defaulting to training/

--normalization

Unicode has code points to encode most glyphs encountered in the wild. A lesser known feature is that there usually are multiple ways to encode a glyph. Unicode normalization ensures that equal glyphs are encoded in the same way, i.e. that the encoded representation across the training data set is consistent and there is only one way the network can recognize a particular feature on the page. Usually it is sufficient to set the normalization to Normalization Form Decomposed (NFD), as it reduces the the size of the overall script to be recognized slightly.

The result will be a directory filled with line image text pairs NNNNNN.png and NNNNNN.gt.txt and a manifest.txt containing a list of all extracted lines.

Note

At this point it is recommended to review the content of the training data directory before proceeding.

Training

The training data in output_dir may now be used to train a new model by invoking the ketos train command. Just hand a list of images to the command such as:

$ ketos train output_dir/*.png

to start training.

A number of lines will be split off into a separate held-out set that is used to estimate the actual recognition accuracy achieved in the real world. These are never shown to the network during training but will be recognized periodically to evaluate the accuracy of the model. Per default the validation set will comprise of 10% of the training data.

Basic model training is mostly automatic albeit there are multiple parameters that can be adjusted:

--output

Sets the prefix for models generated during training. They will best as prefix_epochs.mlmodel.

--report

How often evaluation passes are run on the validation set. It is an integer equal or larger than 1 with 1 meaning a report is created each time the complete training set has been seen by the network.

--savefreq

How often intermediate models are saved to disk. It is an integer with the same semantics as --report.

--load

Continuing training is possible by loading an existing model file with --load. To continue training from a base model with another training set refer to the full ketos documentation.

--preload

Enables/disables preloading of the training set into memory for accelerated training. The default setting preloads data sets with less than 2500 lines, explicitly adding --preload will preload arbitrary sized sets. --no-preload disables preloading in all circumstances.

Training a network will take some time on a modern computer, even with the default parameters. While the exact time required is unpredictable as training is a somewhat random process a rough guide is that accuracy seldomly improves after 50 epochs reached between 8 and 24 hours of training.

When to stop training is a matter of experience; the default setting employs a fairly reliable approach known as early stopping that stops training as soon as the error rate on the validation set doesn’t improve anymore. This will prevent overfitting, i.e. fitting the model to recognize only the training data properly instead of the general patterns contained therein.

$ ketos train output_dir/*.png
Building training set  [####################################]  100%
Building validation set  [####################################]  100%
[270.2364] alphabet mismatch {'9', '8', '݂', '3', '݀', '4', '1', '7', '5', '\xa0'}
Initializing model ✓
Accuracy report (0) -1.5951 3680 9550
epoch 0/-1  [####################################]  788/788
Accuracy report (1) 0.0245 3504 3418
epoch 1/-1  [####################################]  788/788
Accuracy report (2) 0.8445 3504 545
epoch 2/-1  [####################################]  788/788
Accuracy report (3) 0.9541 3504 161
epoch 3/-1  [------------------------------------]  13/788  0d 00:22:09
...

By now there should be a couple of models model_name-1.mlmodel, model_name-2.mlmodel, … in the directory the script was executed in. Lets take a look at each part of the output.

Building training set  [####################################]  100%
Building validation set  [####################################]  100%

shows the progress of loading the training and validation set into memory. This might take a while as preprocessing the whole set and putting it into memory is computationally intensive. Loading can be made faster without preloading at the cost of performing preprocessing repeatedlyduring the training process.

[270.2364] alphabet mismatch {'9', '8', '݂', '3', '݀', '4', '1', '7', '5', '\xa0'}

is a warning about missing characters in either the validation or training set, i.e. that the alphabets of the sets are not equal. Increasing the size of the validation set will often remedy this warning.

Accuracy report (2) 0.8445 3504 545

this line shows the results of the validation set evaluation. The error after 2 epochs is 545 incorrect characters out of 3504 characters in the validation set for a character accuracy of 84.4%. It should decrease fairly rapidly. If accuracy remains around 0.30 something is amiss, e.g. non-reordered right-to-left or wildly incorrect transcriptions. Abort training, correct the error(s) and start again.

After training is finished the best model is saved as model_name_best.mlmodel. It is highly recommended to also archive the training log and data for later reference.

ketos can also produce more verbose output with training set and network information by appending one or more -v to the command:

$ ketos -vv train syr/*.png
[0.7272] Building ground truth set from 876 line images
[0.7281] Taking 88 lines from training for evaluation
...
[0.8479] Training set 788 lines, validation set 88 lines, alphabet 48 symbols
[0.8481] alphabet mismatch {'\xa0', '0', ':', '݀', '܇', '݂', '5'}
[0.8482] grapheme       count
[0.8484] SPACE  5258
[0.8484]        ܐ       3519
[0.8485]        ܘ       2334
[0.8486]        ܝ       2096
[0.8487]        ܠ       1754
[0.8487]        ܢ       1724
[0.8488]        ܕ       1697
[0.8489]        ܗ       1681
[0.8489]        ܡ       1623
[0.8490]        ܪ       1359
[0.8491]        ܬ       1339
[0.8491]        ܒ       1184
[0.8492]        ܥ       824
[0.8492]        .       811
[0.8493] COMBINING DOT BELOW    646
[0.8493]        ܟ       599
[0.8494]        ܫ       577
[0.8495] COMBINING DIAERESIS    488
[0.8495]        ܚ       431
[0.8496]        ܦ       428
[0.8496]        ܩ       307
[0.8497] COMBINING DOT ABOVE    259
[0.8497]        ܣ       256
[0.8498]        ܛ       204
[0.8498]        ܓ       176
[0.8499]        ܀       132
[0.8499]        ܙ       81
[0.8500]        *       66
[0.8501]        ܨ       59
[0.8501]        ܆       40
[0.8502]        [       40
[0.8503]        ]       40
[0.8503]        1       18
[0.8504]        2       11
[0.8504]        ܇       9
[0.8505]        3       8
[0.8505]                6
[0.8506]        5       5
[0.8506] NO-BREAK SPACE 4
[0.8507]        0       4
[0.8507]        6       4
[0.8508]        :       4
[0.8508]        8       4
[0.8509]        9       3
[0.8510]        7       3
[0.8510]        4       3
[0.8511] SYRIAC FEMININE DOT    1
[0.8511] SYRIAC RUKKAKHA        1
[0.8512] Encoding training set
[0.9315] Creating new model [1,1,0,48 Lbx100 Do] with 49 outputs
[0.9318] layer          type    params
[0.9350] 0              rnn     direction b transposed False summarize False out 100 legacy None
[0.9361] 1              dropout probability 0.5 dims 1
[0.9381] 2              linear  augmented False out 49
[0.9918] Constructing RMSprop optimizer (lr: 0.001, momentum: 0.9)
[0.9920] Set OpenMP threads to 4
[0.9920] Moving model to device cpu
[0.9924] Starting evaluation run

indicates that the training is running on 788 transcribed lines and a validation set of 88 lines. 49 different classes, i.e. Unicode code points, where found in these 788 lines. These affect the output size of the network; obviously only these 49 different classes/code points can later be output by the network. Importantly, we can see that certain characters occur markedly less often than others. Characters like the Syriac feminine dot and numerals that occur less than 10 times will most likely not be recognized well by the trained net.

Evaluation and Validation

While output during training is detailed enough to know when to stop training one usually wants to know the specific kinds of errors to expect. Doing more in-depth error analysis also allows to pinpoint weaknesses in the training data, e.g. above average error rates for numerals indicate either a lack of representation of numerals in the training data or erroneous transcription in the first place.

First the trained model has to be applied to some line transcriptions with the ketos test command:

$ ketos test -m syriac_best.mlmodel lines/*.png
Loading model syriac_best.mlmodel ✓
Evaluating syriac_best.mlmodel
Evaluating  [#-----------------------------------]    3%  00:04:56
...

After all lines have been processed a evaluation report will be printed:

=== report  ===

35619     Characters
336       Errors
99.06%    Accuracy

157       Insertions
81        Deletions
98        Substitutions

Count     Missed  %Right
27046     143     99.47%  Syriac
7015      52      99.26%  Common
1558      60      96.15%  Inherited

Errors    Correct-Generated
25        {  } - { COMBINING DOT BELOW }
25        { COMBINING DOT BELOW } - {  }
15        { . } - {  }
15        { COMBINING DIAERESIS } - {  }
12        { ܢ } - {  }
10        {  } - { . }
8 { COMBINING DOT ABOVE } - {  }
8 { ܝ } - {  }
7 { ZERO WIDTH NO-BREAK SPACE } - {  }
7 { ܆ } - {  }
7 { SPACE } - {  }
7 { ܣ } - {  }
6 {  } - { ܝ }
6 { COMBINING DOT ABOVE } - { COMBINING DIAERESIS }
5 { ܙ } - {  }
5 { ܬ } - {  }
5 {  } - { ܢ }
4 { NO-BREAK SPACE } - {  }
4 { COMBINING DIAERESIS } - { COMBINING DOT ABOVE }
4 {  } - { ܒ }
4 {  } - { COMBINING DIAERESIS }
4 { ܗ } - {  }
4 {  } - { ܬ }
4 {  } - { ܘ }
4 { ܕ } - { ܢ }
3 {  } - { ܕ }
3 { ܐ } - {  }
3 { ܗ } - { ܐ }
3 { ܝ } - { ܢ }
3 { ܀ } - { . }
3 {  } - { ܗ }

  .....

The first section of the report consists of a simple accounting of the number of characters in the ground truth, the errors in the recognition output and the resulting accuracy in per cent.

The next table lists the number of insertions (characters occuring in the ground truth but not in the recognition output), substitutions (misrecognized characters), and deletions (superfluous characters recognized by the model).

Next is a grouping of errors (insertions and substitutions) by Unicode script.

The final part of the report are errors sorted by frequency and a per character accuracy report. Importantly most errors are incorrect recognition of combining marks such as dots and diaereses. These may have several sources: different dot placement in training and validation set, incorrect transcription such as non-systematic transcription, or unclean speckled scans. Depending on the error source, correction most often involves adding more training data and fixing transcriptions. Sometimes it may even be advisable to remove unrepresentative data from the training set.

Recognition

The kraken utility is employed for all non-training related tasks. Optical character recognition is a multi-step process consisting of binarization (conversion of input images to black and white), page segmentation (extracting lines from the image), and recognition (converting line image to character sequences). All of these may be run in a single call like this:

$ kraken -i INPUT_IMAGE OUTPUT_FILE binarize segment ocr -m MODEL_FILE

producing a text file from the input image. There are also hocr and ALTO output formats available through the appropriate switches:

$ kraken -i ... ocr -h
$ kraken -i ... ocr -a

For debugging purposes it is sometimes helpful to run each step manually and inspect intermediate results:

$ kraken -i INPUT_IMAGE BW_IMAGE binarize
$ kraken -i BW_IMAGE LINES segment
$ kraken -i BW_IMAGE OUTPUT_FILE ocr -l LINES ...

It is also possible to recognize more than one file at a time by just chaining -i ... ... clauses like this:

$ kraken -i input_1 output_1 -i input_2 output_2 ...

Finally, there is an central repository containing freely available models. Getting a list of all available models:

$ kraken list

Retrieving model metadata for a particular model:

$ kraken show arabic-alam-al-kutub
name: arabic-alam-al-kutub.mlmodel

An experimental model for Classical Arabic texts.

Network trained on 889 lines of [0] as a test case for a general Classical
Arabic model. Ground truth was prepared by Sarah Savant
<sarah.savant@aku.edu> and Maxim Romanov <maxim.romanov@uni-leipzig.de>.

Vocalization was omitted in the ground truth. Training was stopped at ~35000
iterations with an accuracy of 97%.

[0] Ibn al-Faqīh (d. 365 AH). Kitāb al-buldān. Edited by Yūsuf al-Hādī, 1st
edition. Bayrūt: ʿĀlam al-kutub, 1416 AH/1996 CE.
alphabet:  !()-.0123456789:[] «»،؟ءابةتثجحخدذرزسشصضطظعغفقكلمنهوىي ARABIC
MADDAH ABOVE, ARABIC HAMZA ABOVE, ARABIC HAMZA BELOW

and actually fetching the model:

$ kraken get arabic-alam-al-kutub

The downloaded model can then be used for recognition by the name shown in its metadata, e.g.:

$ kraken -i INPUT_IMAGE OUTPUT_FILE binarize segment ocr -m arabic-alam-al-kutub.mlmodel

For more documentation see the kraken website.