.. _ketos: Training ======== This page describes the training utilities available through the ``ketos`` command line utility in depth. For a gentle introduction on model training please refer to the :ref:`tutorial `. There are currently three trainable components in the kraken processing pipeline: * :ref:`Segmentation `: finding lines and regions in images * :ref:`Reading Order `: ordering lines found in the previous segmentation step. Reading order models are closely linked to segmentation models and both are usually trained on the same dataset. * :ref:`Recognition `: recognition models transform images of lines into text. Depending on the use case it is not necessary to manually train new models for each material. The default segmentation model works well on quite a variety of handwritten and printed documents, a reading order model might not perform better than the default heuristic for simple text flows, and there are recognition models for some types of material available in the repository. The main ``ketos`` command has some common options that configure how most of its sub-commands use computational resources: ======================================================= ====== option action ======================================================= ====== -d, \--device Select device to use (cpu, cuda:0, cuda:1,...). GPU acceleration requires in addition to hardware the requisite libraries (CUDA, ROCM, ...). It can be set to `auto` to select the best device available. \--precision The precision to run neural network operations in. On GPU, bfloat16 (`bf16-true` or `bf16-mixed`) can lead to considerable speedup. \--workers Number of workers processes to spawn for data loading, transformation, or compilation. \--threads Number of threads to use for intra-op parallelization like OpenMP, BLAS, ... ======================================================= ====== Training data formats --------------------- .. _ketos_format: The training tools accept a variety of training data formats, usually some kind of custom low level format, the XML-based formats that are commony used for archival of annotation and transcription data, and in the case of recognizer training a precompiled binary format. It is recommended to use the XML formats for segmentation and reading order training and the binary format for recognition training. ALTO ~~~~ Kraken parses ALTO files using schema version 4.2 or upwards and produces files according to ALTO 4.4. An example showing the attributes necessary for segmentation, recognition, and reading order training follows: .. literalinclude:: /_static/alto.xml :language: xml :force: Importantly, the parser only works with measurements in the pixel domain, i.e. an unset `MeasurementUnit` or one with an element value of `pixel`. For a more in-depth description of how kraken parses and writes ALTO files see :ref:`here `. PAGE XML ~~~~~~~~ PAGE XML is parsed and produced according to the 2019-07-15 version of the schema, although the parser is not strict and works with non-conformant output from a variety of tools. As with ALTO, PAGE XML files can be used to train segmentation, reading order, and recognition models. .. literalinclude:: /_static/pagexml.xml :language: xml :force: For a more in-depth description of how kraken parses and writes ALTO files see :ref:`here `. Binary Datasets ~~~~~~~~~~~~~~~ .. _binary_datasets: In addition to training recognition models directly from XML and image files, a binary dataset format offering a couple of advantages is supported for recognition training. Binary datasets drastically improve loading performance allowing the saturation of most GPUs with minimal computational overhead while also allowing training with datasets that are larger than the systems main memory. A minor drawback is a ~30% increase in dataset size in comparison to the raw images + XML approach. To realize this speedup the dataset has to be compiled first: .. code-block:: console $ ketos compile -f xml -o dataset.arrow file_1.xml file_2.xml ... if there are a lot of individual lines containing many lines this process can take a long time. It can easily be parallelized by specifying the number of separate parsing workers with the ``--workers`` option: .. code-block:: console $ ketos --workers 8 compile -f xml ... In addition, binary datasets can contain fixed splits which allow reproducibility and comparability between training and evaluation runs. Training, validation, and test splits can be pre-defined from multiple sources. Per default they are sourced from tags defined in the source XML files unless the option telling kraken to ignore them is set: .. code-block:: console $ ketos compile --ignore-splits -f xml ... Alternatively fixed-proportion random splits can be created ad-hoc during compile time: .. code-block:: console $ ketos compile --random-split 0.8 0.1 0.1 ... The above line splits assigns 80% of the source lines to the training set, 10% to the validation set, and 10% to the test set. The training and validation sets in the dataset file are used automatically by `ketos train` (unless told otherwise) while the remaining 10% of the test set is selected by `ketos test`. .. warning:: Fixed splits in datasets are ignored during training and testing per default as they require loading the entire dataset into main memory at once, drastically increasing memory consumption and causing initial delays. Use the `\-\-fixed-splits` option in `ketos train` and `ketos test` to respect fixed splits.