.. _training: Training a kraken model ======================= kraken is an optical character recognition package that can be trained fairly easily for a large number of scripts. In contrast to other system requiring segmentation down to glyph level before classification, it is uniquely suited for the recognition of connected scripts, because the neural network is trained to assign correct character to unsegmented training data. Training a new model for kraken requires a variable amount of training data manually generated from page images which have to be typographically similar to the target prints that are to be recognized. As the system works on unsegmented inputs for both training and recognition and its base unit is a text line, training data are just transcriptions aligned to line images. Installing kraken ----------------- The easiest way to install and use kraken is through `conda `_. kraken works both on Linux and Mac OS X. After installing conda, download the environment file and create the environment for kraken: .. code-block:: console $ wget https://raw.githubusercontent.com/mittagessen/kraken/master/environment.yml $ conda env create -f environment.yml Each time you want to use the kraken environment in a shell is has to be activated first: .. code-block:: console $ conda activate kraken Image acquisition and preprocessing ----------------------------------- First a number of high quality scans, preferably color or grayscale and at least 300dpi are required. Scans should be in a lossless image format such as TIFF or PNG, images in PDF files have to be extracted beforehand using a tool such as ``pdftocairo`` or ``pdfimages``. While each of these requirements can be relaxed to a degree, the final accuracy will suffer to some extent. For example, only slightly compressed JPEG scans are generally suitable for training and recognition. Depending on the source of the scans some preprocessing such as splitting scans into pages, correcting skew and warp, and removing speckles is usually required. For complex layouts such as newspapers it is advisable to split the page manually into columns as the line extraction algorithm run to create transcription environments does not deal well with non-codex page layouts. A fairly user-friendly software for semi-automatic batch processing of image scans is `Scantailor `_ albeit most work can be done using a standard image editor. The total number of scans required depends on the nature of the script to be recognized. Only features that are found on the page images and training data derived from it can later be recognized, so it is important that the coverage of typographic features is exhaustive. Training a single script model for a fairly small script such as Arabic or Hebrew requires at least 800 lines, while multi-script models, e.g. combined polytonic Greek and Latin, will require significantly more transcriptions. There is no hard rule for the amount of training data and it may be required to retrain a model after the initial training data proves insufficient. Most ``western`` texts contain between 25 and 40 lines per page, therefore upward of 30 pages have to be preprocessed and later transcribed. Transcription ------------- Transcription is done through local browser based HTML transcription environments. These are created by the ``ketos transcribe`` command line util that is part of kraken. Its basic input is just a number of image files and an output path to write the HTML file to: .. code-block:: console $ ketos transcribe -o output.html image_1.png image_2.png ... While it is possible to put multiple images into a single transcription environment splitting into one-image-per-HTML will ease parallel transcription by multiple people. The above command reads in the image files, converts them to black and white if necessary, tries to split them into line images, and puts an editable text field next to the image in the HTML. Transcription has to be diplomatic, i.e. contain the exact character sequence in the line image, including original orthography. Some deviations, such as consistently omitting vocalization in Arabic texts, is possible as long as they are systematic and relatively minor. .. note:: The page segmentation algorithm extracting lines from images is optimized for ``western`` page layouts and may recognize lines erroneously, lumping multiple lines together or cutting them in half. The most efficient way to deal with these errors is just skipping the affected lines by leaving the text box empty. .. tip:: Copy-paste transcription can significantly speed up the whole process. Either transcribe scans of a work where a digital edition already exists (but does not for typographically similar prints) or find a sufficiently similar edition as a base. After transcribing a number of lines the results have to be saved, either using the ``Download`` button on the lower left or through the regular ``Save Page As`` (CTRL+S) function of the browser. All the work done is contained directly in the saved files and it is possible to save partially transcribed files and continue work later. Next the contents of the filled transcription environments have to be extracted through the ``ketos extract`` command: .. code-block:: console $ ketos extract --output output_directory --normalization NFD *.html with --output The output directory where all line image-text pairs (training data) are written, defaulting to ``training/`` --normalization Unicode has code points to encode most glyphs encountered in the wild. A lesser known feature is that there usually are multiple ways to encode a glyph. `Unicode normalization `_ ensures that equal glyphs are encoded in the same way, i.e. that the encoded representation across the training data set is consistent and there is only one way the network can recognize a particular feature on the page. Usually it is sufficient to set the normalization to Normalization Form Decomposed (NFD), as it reduces the the size of the overall script to be recognized slightly. The result will be a directory filled with line image text pairs ``NNNNNN.png`` and ``NNNNNN.gt.txt`` and a ``manifest.txt`` containing a list of all extracted lines. .. note:: At this point it is recommended to review the content of the training data directory before proceeding. Training -------- The training data in ``output_dir`` may now be used to train a new model by invoking the ``ketos train`` command. Just hand a list of images to the command such as: .. code-block:: console $ ketos train output_dir/*.png to start training. A number of lines will be split off into a separate held-out set that is used to estimate the actual recognition accuracy achieved in the real world. These are never shown to the network during training but will be recognized periodically to evaluate the accuracy of the model. Per default the validation set will comprise of 10% of the training data. Basic model training is mostly automatic albeit there are multiple parameters that can be adjusted: --output Sets the prefix for models generated during training. They will best as ``prefix_epochs.mlmodel``. --report How often evaluation passes are run on the validation set. It is an integer equal or larger than 1 with 1 meaning a report is created each time the complete training set has been seen by the network. --savefreq How often intermediate models are saved to disk. It is an integer with the same semantics as ``--report``. --load Continuing training is possible by loading an existing model file with ``--load``. To continue training from a base model with another training set refer to the full :ref:`ketos ` documentation. --preload Enables/disables preloading of the training set into memory for accelerated training. The default setting preloads data sets with less than 2500 lines, explicitly adding ``--preload`` will preload arbitrary sized sets. ``--no-preload`` disables preloading in all circumstances. Training a network will take some time on a modern computer, even with the default parameters. While the exact time required is unpredictable as training is a somewhat random process a rough guide is that accuracy seldomly improves after 50 epochs reached between 8 and 24 hours of training. When to stop training is a matter of experience; the default setting employs a fairly reliable approach known as `early stopping `_ that stops training as soon as the error rate on the validation set doesn't improve anymore. This will prevent `overfitting `_, i.e. fitting the model to recognize only the training data properly instead of the general patterns contained therein. .. code-block:: console $ ketos train output_dir/*.png Building training set [####################################] 100% Building validation set [####################################] 100% [270.2364] alphabet mismatch {'9', '8', '݂', '3', '݀', '4', '1', '7', '5', '\xa0'} Initializing model ✓ Accuracy report (0) -1.5951 3680 9550 epoch 0/-1 [####################################] 788/788 Accuracy report (1) 0.0245 3504 3418 epoch 1/-1 [####################################] 788/788 Accuracy report (2) 0.8445 3504 545 epoch 2/-1 [####################################] 788/788 Accuracy report (3) 0.9541 3504 161 epoch 3/-1 [------------------------------------] 13/788 0d 00:22:09 ... By now there should be a couple of models model_name-1.mlmodel, model_name-2.mlmodel, ... in the directory the script was executed in. Lets take a look at each part of the output. .. code-block:: console Building training set [####################################] 100% Building validation set [####################################] 100% shows the progress of loading the training and validation set into memory. This might take a while as preprocessing the whole set and putting it into memory is computationally intensive. Loading can be made faster without preloading at the cost of performing preprocessing repeatedlyduring the training process. .. code-block:: console [270.2364] alphabet mismatch {'9', '8', '݂', '3', '݀', '4', '1', '7', '5', '\xa0'} is a warning about missing characters in either the validation or training set, i.e. that the alphabets of the sets are not equal. Increasing the size of the validation set will often remedy this warning. .. code-block:: console Accuracy report (2) 0.8445 3504 545 this line shows the results of the validation set evaluation. The error after 2 epochs is 545 incorrect characters out of 3504 characters in the validation set for a character accuracy of 84.4%. It should decrease fairly rapidly. If accuracy remains around 0.30 something is amiss, e.g. non-reordered right-to-left or wildly incorrect transcriptions. Abort training, correct the error(s) and start again. After training is finished the best model is saved as ``model_name_best.mlmodel``. It is highly recommended to also archive the training log and data for later reference. ``ketos`` can also produce more verbose output with training set and network information by appending one or more ``-v`` to the command: .. code-block:: console $ ketos -vv train syr/*.png [0.7272] Building ground truth set from 876 line images [0.7281] Taking 88 lines from training for evaluation ... [0.8479] Training set 788 lines, validation set 88 lines, alphabet 48 symbols [0.8481] alphabet mismatch {'\xa0', '0', ':', '݀', '܇', '݂', '5'} [0.8482] grapheme count [0.8484] SPACE 5258 [0.8484] ܐ 3519 [0.8485] ܘ 2334 [0.8486] ܝ 2096 [0.8487] ܠ 1754 [0.8487] ܢ 1724 [0.8488] ܕ 1697 [0.8489] ܗ 1681 [0.8489] ܡ 1623 [0.8490] ܪ 1359 [0.8491] ܬ 1339 [0.8491] ܒ 1184 [0.8492] ܥ 824 [0.8492] . 811 [0.8493] COMBINING DOT BELOW 646 [0.8493] ܟ 599 [0.8494] ܫ 577 [0.8495] COMBINING DIAERESIS 488 [0.8495] ܚ 431 [0.8496] ܦ 428 [0.8496] ܩ 307 [0.8497] COMBINING DOT ABOVE 259 [0.8497] ܣ 256 [0.8498] ܛ 204 [0.8498] ܓ 176 [0.8499] ܀ 132 [0.8499] ܙ 81 [0.8500] * 66 [0.8501] ܨ 59 [0.8501] ܆ 40 [0.8502] [ 40 [0.8503] ] 40 [0.8503] 1 18 [0.8504] 2 11 [0.8504] ܇ 9 [0.8505] 3 8 [0.8505] 6 [0.8506] 5 5 [0.8506] NO-BREAK SPACE 4 [0.8507] 0 4 [0.8507] 6 4 [0.8508] : 4 [0.8508] 8 4 [0.8509] 9 3 [0.8510] 7 3 [0.8510] 4 3 [0.8511] SYRIAC FEMININE DOT 1 [0.8511] SYRIAC RUKKAKHA 1 [0.8512] Encoding training set [0.9315] Creating new model [1,1,0,48 Lbx100 Do] with 49 outputs [0.9318] layer type params [0.9350] 0 rnn direction b transposed False summarize False out 100 legacy None [0.9361] 1 dropout probability 0.5 dims 1 [0.9381] 2 linear augmented False out 49 [0.9918] Constructing RMSprop optimizer (lr: 0.001, momentum: 0.9) [0.9920] Set OpenMP threads to 4 [0.9920] Moving model to device cpu [0.9924] Starting evaluation run indicates that the training is running on 788 transcribed lines and a validation set of 88 lines. 49 different classes, i.e. Unicode code points, where found in these 788 lines. These affect the output size of the network; obviously only these 49 different classes/code points can later be output by the network. Importantly, we can see that certain characters occur markedly less often than others. Characters like the Syriac feminine dot and numerals that occur less than 10 times will most likely not be recognized well by the trained net. Evaluation and Validation ------------------------- While output during training is detailed enough to know when to stop training one usually wants to know the specific kinds of errors to expect. Doing more in-depth error analysis also allows to pinpoint weaknesses in the training data, e.g. above average error rates for numerals indicate either a lack of representation of numerals in the training data or erroneous transcription in the first place. First the trained model has to be applied to some line transcriptions with the `ketos test` command: .. code-block:: console $ ketos test -m syriac_best.mlmodel lines/*.png Loading model syriac_best.mlmodel ✓ Evaluating syriac_best.mlmodel Evaluating [#-----------------------------------] 3% 00:04:56 ... After all lines have been processed a evaluation report will be printed: .. code-block:: console === report === 35619 Characters 336 Errors 99.06% Accuracy 157 Insertions 81 Deletions 98 Substitutions Count Missed %Right 27046 143 99.47% Syriac 7015 52 99.26% Common 1558 60 96.15% Inherited Errors Correct-Generated 25 { } - { COMBINING DOT BELOW } 25 { COMBINING DOT BELOW } - { } 15 { . } - { } 15 { COMBINING DIAERESIS } - { } 12 { ܢ } - { } 10 { } - { . } 8 { COMBINING DOT ABOVE } - { } 8 { ܝ } - { } 7 { ZERO WIDTH NO-BREAK SPACE } - { } 7 { ܆ } - { } 7 { SPACE } - { } 7 { ܣ } - { } 6 { } - { ܝ } 6 { COMBINING DOT ABOVE } - { COMBINING DIAERESIS } 5 { ܙ } - { } 5 { ܬ } - { } 5 { } - { ܢ } 4 { NO-BREAK SPACE } - { } 4 { COMBINING DIAERESIS } - { COMBINING DOT ABOVE } 4 { } - { ܒ } 4 { } - { COMBINING DIAERESIS } 4 { ܗ } - { } 4 { } - { ܬ } 4 { } - { ܘ } 4 { ܕ } - { ܢ } 3 { } - { ܕ } 3 { ܐ } - { } 3 { ܗ } - { ܐ } 3 { ܝ } - { ܢ } 3 { ܀ } - { . } 3 { } - { ܗ } ..... The first section of the report consists of a simple accounting of the number of characters in the ground truth, the errors in the recognition output and the resulting accuracy in per cent. The next table lists the number of insertions (characters occuring in the ground truth but not in the recognition output), substitutions (misrecognized characters), and deletions (superfluous characters recognized by the model). Next is a grouping of errors (insertions and substitutions) by Unicode script. The final part of the report are errors sorted by frequency and a per character accuracy report. Importantly most errors are incorrect recognition of combining marks such as dots and diaereses. These may have several sources: different dot placement in training and validation set, incorrect transcription such as non-systematic transcription, or unclean speckled scans. Depending on the error source, correction most often involves adding more training data and fixing transcriptions. Sometimes it may even be advisable to remove unrepresentative data from the training set. Recognition ----------- The ``kraken`` utility is employed for all non-training related tasks. Optical character recognition is a multi-step process consisting of binarization (conversion of input images to black and white), page segmentation (extracting lines from the image), and recognition (converting line image to character sequences). All of these may be run in a single call like this: .. code-block:: console $ kraken -i INPUT_IMAGE OUTPUT_FILE binarize segment ocr -m MODEL_FILE producing a text file from the input image. There are also `hocr `_ and `ALTO `_ output formats available through the appropriate switches: .. code-block:: console $ kraken -i ... ocr -h $ kraken -i ... ocr -a For debugging purposes it is sometimes helpful to run each step manually and inspect intermediate results: .. code-block:: console $ kraken -i INPUT_IMAGE BW_IMAGE binarize $ kraken -i BW_IMAGE LINES segment $ kraken -i BW_IMAGE OUTPUT_FILE ocr -l LINES ... It is also possible to recognize more than one file at a time by just chaining ``-i ... ...`` clauses like this: .. code-block:: console $ kraken -i input_1 output_1 -i input_2 output_2 ... Finally, there is an central repository containing freely available models. Getting a list of all available models: .. code-block:: console $ kraken list Retrieving model metadata for a particular model: .. code-block:: console $ kraken show arabic-alam-al-kutub name: arabic-alam-al-kutub.mlmodel An experimental model for Classical Arabic texts. Network trained on 889 lines of [0] as a test case for a general Classical Arabic model. Ground truth was prepared by Sarah Savant and Maxim Romanov . Vocalization was omitted in the ground truth. Training was stopped at ~35000 iterations with an accuracy of 97%. [0] Ibn al-Faqīh (d. 365 AH). Kitāb al-buldān. Edited by Yūsuf al-Hādī, 1st edition. Bayrūt: ʿĀlam al-kutub, 1416 AH/1996 CE. alphabet: !()-.0123456789:[] «»،؟ءابةتثجحخدذرزسشصضطظعغفقكلمنهوىي ARABIC MADDAH ABOVE, ARABIC HAMZA ABOVE, ARABIC HAMZA BELOW and actually fetching the model: .. code-block:: console $ kraken get arabic-alam-al-kutub The downloaded model can then be used for recognition by the name shown in its metadata, e.g.: .. code-block:: console $ kraken -i INPUT_IMAGE OUTPUT_FILE binarize segment ocr -m arabic-alam-al-kutub.mlmodel For more documentation see the kraken `website `_.