DoMY Glossary
This glossary includes common terms used in statistical machine translation (SMT), Moses Decoder and Do Moses Yourself open source projects. Source code "SMT" means the term is applicable to all statistical machine translation tools. Source code "DoMY" means the term is specific to Do Moses Yourself applications.
Term | Source |
Description |
SMT |
Aligned data are the elements of a parallel corpus consisting of two or more languages. Each element in one language matches the corresponding element in the other language(s). The elements, sometimes called segments, can be block-aligned, paragraph-aligned, sentence-aligned, phrase-aligned or token-aligned. |
|
aligned data | DoMY |
CorpusFiltergraph saves aligned data as block-aligned, paragraph-aligned, sentence-aligned and phrase-aligned training corpus in individual aligned files with one element per line. Under the CORPORA root type, aligned data is saved in in separate data files in tm folder trees. Under the BUILDS root type, aligned training data is consolidated into one aligned file per language under the tm folder tree. |
DoMY |
Aligner class plug-ins manipulate, change or otherwise access parallel data in the CorpusFiltergraph data buffers. Aligner class plug-ins can re-align data from one type of alignment to another, such as transforming file-aligned data to paragraph- or sentence-aligned data. Aligner class plug-ins can also compare source and target language data and apply processes based on the relationship between the pairs. |
|
SMT |
There are two alignment processes. In corpus preparation, the alignment process creates aligned data. During training, the alignment process uses a program such as MGIZA++ to create word alignment files in the alignments folder. |
|
DoMY |
The “alignments” folder tree under the TRAININGS root type hosts temporary output from the MGIZA++ word alignment process. Word alignment is a vital step in training tm build sets into a phrase table and reordering table. The “alignments” subfolders names are encoded with identifying information and organized to enable the maximum reuse of the supporting files during subsequent trainings and incremental trainings. If they are deleted, the files will be recreated during the next training for the same tm build set. |
|
DoMY |
The “bitext.*” files are pairs of sentence-aligned source and target language data, sometimes containing of multiple-millions of sentence pairs. They are the parts of the tm build set that are used in the word alignment process. |
|
SMT |
BLEU stands for Bi-Lingual Evaluation Understudy”. A BLEU score indicates how closely the token sequences in one set of data, such as machine translation output, correlate with (match) the token sequences in another set of data, such as a reference human translation. See: evaluation process |
|
DoMY |
A “build name” is a name assigned to a build set by the user. This name becomes a folder name under the lm folder or tm folder in the BUILDS |
|
DoMY |
The “build process” copies and consolidates data from the “ready” stage in the CORPORA root type to a “build set” in the BUILDS root type. The build process is the final step to prepare a training corpus. The process finds data data that match the user's configuration, applies final tokenization for each language, consolidates multiple data files into the files necessary to start the training process. |
|
DoMY |
A “build set” is a group of files that are the output of a build process. An lm build set becomes the input data to create a language model. A tm build set becomes the input data to train a phrase table and a reordering table. |
|
DoMY |
A “build type” folder is a folder under the BUILDS root type that hosts an “lm” folder for lm build sets and a “tm” folder for tm build sets. |
|
DoMY |
The BUILDS tree is a root type that hosts lm folder and tm folder trees that hold build sets created in the build process. |
|
DoMY |
The CORPORA tree is a root type that hosts five pairs of “stage” folder trees. These stage trees store the output of various corpus cleaning and preparation processes.The CORPORA.demo root type is a temporary folder with output from the demo script that can be deleted. |
|
DoMY |
“CorpusFiltergraph” is Precision Translation Tool's configuration-driven filter graph designed with parallel toolchains that to operate on various reams of text data in different languages. Users edit configuration files that define the sequence of filter plug-ins that process data. |
|
SMT |
Corpus preparation is the general process to extract, transform, categorize various documents from their original purpose to and align the resulting data into a parallel corpus for training a translation model. |
|
DoMY |
The tm folders and lm folder under a stage folder in the CORPORA root type are “corpus type” folders. |
|
DoMY |
A domain is a sub-attribute of the superdomain. Folders under a domain tree in file system hierarchy hold data sharing the attribute. Domain depth folders are optional but strongly recommended. When present, the domain depth is one folder depth under a superdomain folder. |
|
DoMY |
The train-mert and train-eval scripts encodes their output folders with the command line options used for tuning and evaluation. The folder names of the mert and eval processes is referred to as the “engine name.” The engine name includes the source language and target language 2-letter codes, the tm build name, the word aligner, and n-grams size of the tables, the lm build name, the lm type, and n-grams size of the lm. |
|
DoMY |
A root type. The ENGINES tree hosts four subfolder trees with various completed components that constitute translation models. The four trees are: lms, tables, evals and recasers. |
|
DoMY |
The “eval.*” files are a pair of source and target language data, typically consisting of several thousand pairs of sentences. As part of the tm build set, they are used in the evaluation process. The eval data is also saved in eval*.sgm files used by the mteval-v12.pl program to generate a final BLEU score evaluation report for the tuned translation model. |
|
DoMY |
The “evals” subfolder hosts translation model configurations and the final BLEU score evaluation report. The evals subfolder names use a convention described in the “engine name” entry. |
|
SMT |
The evaluation process uses a translation model of components created in the training process and configured with the tuning process to translate several thousand source language sentences in the eval files. This process then compares the resulting machine translations to reference translations, also in the eval files. The final BLEU score evaluation report shows how well the machine translations match the reference translations. |
|
DoMY |
A subfolder tree pair under the CORPORA root type that hosts “file-aligned” data. See: CORPORA, parallel data, aligned data |
|
DoMY |
A “filter graph” is a data processing framework that divides tasks into sequences of fundamental programming tools called filters. Like a toolchain, the output of one filter connects to the input of another filter to perform more complex tasks. In addition, a filter graph operates on parallel streams of different data types while maintaining the alignment or synchronization between the different streams. The term “filter graph” originally referred to a framework for processing parallel streams of multimedia data types, such as audio, video and subtitles. Examples of multimedia filter graphs include Microsoft's DirectShow, Apple's QuickTime, and the Linux GNU Gstreamer. CorpusFiltergraph is a filter graph created by Precision Translation Tools for processing parallel linguistic corpora. |
|
DoMY |
A filter class plug-in cleans, changes or otherwise accesses the monolingual data in the filter graph's data buffers. Filters apply language-specific or language-independent processes, but the filter graph always applies them on a language-specific basis. CorpusFiltergraph tracks two different toolchains of filters per language. It applies the “filter” toolchain before processing the aligner plug-ins. It applies the “postfilter” toolchain after processing the aligner plug-ins. |
|
DoMY |
A “graph” is a CorpusFiltergraph configuration defining a sequence of fundamental processing steps that perform a specific data processing task such as extraction, cleaning, re-alignment or translation. The graph's name is one of the required domy command line options. |
|
SMT |
SMT translation model that uses hierarchical training corpus. |
|
SMT |
A training corpus with each phrase annotated with the hierarchical structure of the language, such as parts of speech, word function, etc. |
|
SMT |
A “language model” or “lm” is a statistical description of one language that includes the frequencies of token-based n-grams occurrences in a corpus. The “lm” is trained from a large monolingual corpus and saved as a file in the lms folder under the ENGINES root type. The language model file is a required component of every translation model. Moses uses language model to select the most “probably” target language sentence from a large set of “possible” translations it generated using the phrase table and reordering table. |
|
SMT |
Language model files contain statistical data generated by one of various programs. Moses Decoder can use language model file types including: KenLM SRILM, RandLM and IRSTLM. SRILM, RandLM and IRSTLM toolkits include tools that train the new language model files. KenLM, however, only reads ARPA standard language model files which can be created by SRILM, IRSTLM. DoMY can create language models in all these formats and configure Moses Decoder to use all. |
|
DoMY |
An “lm build set” is a build set in the lm subfolder of the BUILDS root type created by the build process. The set has one file named “monotext.tt” with consolidated monolingual data, one sentence per line, for training to a language model. Anogher file named “recasetext.tt” with consolidated data, one sentence per line is for training to a recaser model. |
|
DoMY |
CorpusFiltergraph supports two different lm folder types. Both lm folder types host un-aligned language model data. An lm folder under the BUILDS root type hosts lm build sets. An lm folder under a corpus type tree in the CORPORA root type hosts language model data in individual files in source and target language folders. |
|
DoMY |
The train-lm script encodes its output folder with the command line options used for training the lm. The folder name is referred to as the “lm name.” The lm name includes the target language 2-letter code, the lm build name, the lm type, and n-grams size of the lm. |
|
DoMY |
The “lms” folder hosts trained language model files under the ENGINES root type. The lms subfolder names use a convention described in the “lm name” entry. |
|
DoMY |
The “mert.*” files are a pair of source and target language data, typically containing of several thousands of pairs. As part of the tm build set, created by the build process, they are used in the tuning process. |
|
DoMY |
The “merts” folder tree in the TRAININGS root type hosts subfolders with temporary files used during the “minimum error rate tuning” tuning process. These temporary files are reused to resume a tuning if tuning was interrupted before completion. After tuning is complete, these temporary files are not reused and can be deleted without concern. The merts subfolder names use a convention described in the “engine name” entry. |
|
SMT |
The moses configuration file is a text file created during the tuning process. The file contains the paths to the phrase table(s), reordering table, language model(s) with other codes and numeric values that control how the Moses Decoder works. |
|
SMT |
An n-gram is a subsequence of n number of (1, 2, 3, etc) items in a larger sequence. In an lm n-grams are sequences of tokens. In phrase tables and reordering tables, n-grams are sequences of pairs of source and target language tokens. |
|
DoMY |
A subfolder tree pair under the CORPORA root type that hosts “paragraph-aligned” data. |
|
SMT |
A linguistic corpus of two or more languages where each element in one language corresponds to an element with the same meaning in the other language(s). The original, authored language is identified as the source language. Non-source languages are referred to as “target” languages. For Moses SMT, parallel data takes the form of one source and one target language text file where both files contain corresponding translation of sentences line by line. |
|
SMT |
See “parallel data” |
|
SMT |
A “phrase table” is a statistical description of a parallel corpus of source-target language sentence pairs. The frequencies that n-grams in a source language text co-occur with n-grams in a parallel target language text represent the probability that those source-target paired n-grams will occur again in other texts similar to the parallel corpus. In practical terms, the phrase table is a file created during the training process and saved as a file in the tables subfolder of the ENGINES root type. It functions as a sophisticated dictionary between the source and target languages. Phrase tables and reordering tables are translation model components. |
|
SMT |
A “pipeline” is a toolchain of processes connected by standard streams, so that the output of each process (stdout) feeds directly as input (stdin) to the next one. See: CorpusFiltergraph |
|
DoMY |
plug-ins are filters programming modules that can be assembled into parallel toolchains within the filter graph. The config.ini file for a graph defines which plug-ins and the order they are loaded. CorpusFiltergraph supports four plug-in classes: readers, writers, filters and aligners. |
|
DoMY |
A subfolder tree pair under the CORPORA root type that hosts data for a “quality-check” human review. |
|
DoMY |
The RAW tree is a root type that contains unstructured or user-defined folder hierarchies that serve as a temporary holding area to “import” data into the CORPORA root type tree. Also, data exported from the CORPORA root type tree is saved in the RAW tree. The RAW.demo root type is a temporary folder with output from the demo script that can be deleted. |
|
DoMY |
Reader class plug-ins serve as the input to CorpusFiltergraph because they read data from a data storage into data buffers for downstream filters. Current reader class plug-ins include “reader-file.py” that reads text data files from a hierarchical file system, and “reader-tmx.py” that reads <tuv> segments from a TMX xml file. Future reader class plug-ins could include a “reader-sql.py” plug-in to read data directly from a SQL database. Reader plug-ins are language-independent. |
|
DoMY |
A subfolder tree pair under the CORPORA root type that hosts data after human QC that is ready for consolidation into build sets. |
|
SMT |
A recaser model is a special translation model translates lower cased data to “natural” cased text (upper and lower casing). |
|
DoMY |
The train-recaser script encodes its output folder with the command line options used for training the recaser. The folder name is referred to as the “recaser name.” The lm name includes the target language 2-letter code, the recaser build name, the lm type, and n-grams size of the lm. |
|
DoMY |
The “recasers” subfolder in the ENGINES root type hosts complete recaser models. The recasers subfolder names use a convention described in “recaser name” entry. |
|
DoMY |
The render language is the language of the text, regardless of source and/or render language. The render language is the same as the TMX specification “lang” attribute of the <tuv> tag. In CorpusFiltergraph, the render language is the deepest folder level in the CORPORA root tree type hierarchy. |
|
SMT |
A “reordering table” contains the statistical frequencies that describe the changes in word order between source and target languages, such as “big house” versus “house big”. In practical terms, a “reordering table” is a file created during the training process and saved as a file in the tables subfolder of the ENGINES root type. The reordering table is translation model components. |
|
DoMY |
The root folder is the top-level folder tree that hosts all DoMY data. The default system rootfolder is “/opt/domy”. The users can override the default value in the ~/domy/domy-ce.ini file. Each graph's config.ini can also override the default. The root folder hosts six root types. |
|
DoMY |
“Root type” folders are the folder trees under the root folder. Each root type tree has a different folder hierarchy. All subfolders within a root type share the same folder hierarchy. Valid root type folders include BUILDS, CORPORA, ENGINES, RAW, TRAININGS, and TRANSLATIONS. |
|
DoMY |
A subfolder tree pair under the CORPORA root type that hosts “sentence-aligned” parallel data. |
|
SMT |
The source language is the language of the text that is to be translated. Typically, this is the authored language of the text. The source language is the same as the TMX specification “srclang” attribute of the <tu> tag. In CorpusFiltergraph, the source language is the second-deepest folder level in the CORPORA root tree type hierarchy. |
|
DoMY |
“Stage” folders are pairs of trees under the CORPORA root type. The main halfs of the stage folder pairs are “fa”, “pa”, “sa”, “qc” and “ready.” Data segments on each text file line in the “fa”, “pa” and “sa” folders correspond to the TMX segtype attributes “block”, “paragraph” and “sentence” respectively. The paired “workbench” folder hosts data that failed any automated data quality checks. Decending depths into the hierarchy of stages includes superdomains, domains, subdomains, corpus type, source language, and target language folders. |
|
DoMY |
A subdomain is a sub-attribute of the domain. Folders under a subdomain tree in file system hierarchy hold data sharing the attribute. Subdomain depth folders are optional but strongly recommended. When present, the subdomain depth is one folder depth under a domain folder. |
|
DoMY |
A superdomain is the most encompassing attribute that describes corpora characteristics. Folders under a superdomain tree in file system hierarchy hold data that sharing the attribute. Superdomain depth folders are optional but strongly recommended. When present, the superdomain depth is one folder depth under a stage folder. |
|
DoMY |
The train-tables script encodes its output folder with the command line options used to train the tables. This folder name is referred to as the “table name.” The table name includes the source language and target language 2-letter codes, the tm build name, the word aligner, and n-grams size of the tables. |
|
DoMY |
The “tables” folder under the ENGINES root type holds trained phrase tables and reordering tables. The tables subfolder names use a convention described in the “table name” entry. |
|
SMT |
The target language is the language the source language text should be translated to. |
|
DoMY |
A “tm build set” is a build set in the tm subfolder of the BUILDS root type. Tm build sets consist of three pairs of parallel data files plus six supporting sgm/xml files needed to complete a training and tuning process. |
|
DoMY |
CorpusFiltergraph supports two different tm folder types. Both tm folder types host aligned parallel data. A tm folder under the BUILDS root type hosts tm build sets. A tm folder under a corpus type tree in the CORPORA root type that hosts parallel data in source and target language folders. |
|
SMT |
Tokenization is the process of separating words from punctuation and symbols into tokens. |
|
SMT |
Tokens are the basic unit in a machine translation process. Tokens are a sequence of characters, such as words, punctuation or symbols, separated by a space. See: BLEU score |
|
SMT |
A “toolchain” is a series of linked or “chained” programming tools used in a series where the output of an upstream tool become the input for a “downstream” tool. See: CorpusFiltergraph |
|
SMT |
A linguistic corpus with parallel data prepared for training into the phrase table and a reordering table components of a translation model. |
|
SMT |
See: training corpus |
|
SMT |
Training is a process in the machine learning branch of artificial intelligence field. In the training process, a system “learns” the relationships between parallel data. In SMT, the source language texts are stimuli that generate the target language text as a response. In practical terms, training starts with the bitext.* files and creates the phrase table and reordering table that are components of a translation model. |
|
DoMY |
The TRAININGS tree is root type that hosts the “alignments” and “merts” subfolder trees. |
|
DoMY |
A “translation engine” consists of a translation model, a supporting “translate” graph and a recaser model. The translate graph prepares the source language documents before translation and post-processes the target language data into usable text with the recaser model. |
|
SMT |
A translation memory (tm) is parallel data that was collected for the purpose of aiding future translations. |
|
SMT |
A “translation model” consists of one or more phrase tables, zero or more reordering tables, one or more language models and one moses configuration file that were created during the training and tuning processes. |
|
DoMY |
The TRANSLATIONS tree is a root type that hosts subfolders that support translation and alignment of translated target language output with source language. Source language documents should be placed in the “lm” subfolder tree. Translated target language documents are saved in the “tm” subfolder tree with a copy of the source language. |
|
SMT |
Tuning is a process that finds the optimized configuration file settings for a translation model when used a specific purpose. The tuning process translates thousands of source language phrases in the mert* files with a translation model, compares the model's output to a set of reference human translations, and adjusts the settings with the intention to improve the translation quality. This process continues through numerous iterations. With each iteration, the tuning process repeats the steps until it reaches an optimized translation quality. |
|
SMT |
A word is the smallest unit of meaning in a language that will stand on its own. In SMT, a word is a token created in the tokenization process that is not a punctuation or symbol. |
|
SMT |
A word aligner is a program that created word alignment files during the word alignment process. Moses currently supports these word aligners: GIZA++, MGIZA++, and BerkeleyAligner. DoMY uses MGIZA++ by default. |
|
SMT |
Word alignment process uses a word aligner to create a word alignment file saved under the alignments folder in the TRAININGS root type during the training process. |
|
DoMY |
Folders with the “workbench” suffix under the CORPORA root type are paired with a stage of processing, such as “sa” and “sa-workbench”. Workbench folders contain data that failed an automated quality control process for its stage pair and should be reviewed by an editor. |
|
DoMY |
Writer class plug-ins serve as the output from CorpusFiltergraph because they write data from upstream data buffers to an storage mechanism outside CorpusFiltergraph. Current writer class plug-ins include “writer-file.py” that writes text data files to a hierarchical file system, and “writer-tmx.py” that writes XML files compliant to the TMX version 1.4 specification with sentence segment types. Future writer class plugins could include a writer-sql.py plugin to save data directly to an SQL database. Writer plug-ins are language independent. |
2011-10-22 21:15
Copyright © 2011 Precision Translation Tools Co., Ltd.