DoMY Glossary

This glossary includes common terms used in statistical machine translation (SMT), Moses Decoder and Do Moses Yourself open source projects. Source code "SMT" means the term is applicable to all statistical machine translation tools. Source code "DoMY" means the term is specific to Do Moses Yourself applications.

Term

Source

Description

aligned data

SMT

Aligned data are the elements of a parallel corpus consisting of two or more languages. Each element in one language matches the corresponding element in the other language(s). The elements, sometimes called segments, can be block-aligned, paragraph-aligned, sentence-aligned, phrase-aligned or token-aligned.

aligned data

DoMY

CorpusFiltergraph saves aligned data as block-aligned, paragraph-aligned, sentence-aligned and phrase-aligned training corpus in individual aligned files with one element per line. Under the CORPORA root type, aligned data is saved in in separate data files in tm folder trees. Under the BUILDS root type, aligned training data is consolidated into one aligned file per language under the tm folder tree.

aligner plug-in

DoMY

Aligner class plug-ins manipulate, change or otherwise access parallel data in the CorpusFiltergraph data buffers. Aligner class plug-ins can re-align data from one type of alignment to another, such as transforming file-aligned data to paragraph- or sentence-aligned data. Aligner class plug-ins can also compare source and target language data and apply processes based on the relationship between the pairs.

alignment process

SMT

There are two alignment processes. In corpus preparation, the alignment process creates aligned data. During training, the alignment process uses a program such as MGIZA++ to create word alignment files in the alignments folder.

alignments folder

DoMY

The “alignments” folder tree under the TRAININGS root type hosts temporary output from the MGIZA++ word alignment process. Word alignment is a vital step in training tm build sets into a phrase table and reordering table. The “alignments” subfolders names are encoded with identifying information and organized to enable the maximum reuse of the supporting files during subsequent trainings and incremental trainings. If they are deleted, the files will be recreated during the next training for the same tm build set.

bitext.* files

DoMY

The “bitext.*” files are pairs of sentence-aligned source and target language data, sometimes containing of multiple-millions of sentence pairs. They are the parts of the tm build set that are used in the word alignment process.

BLEU score

SMT

BLEU stands for Bi-Lingual Evaluation Understudy”. A BLEU score indicates how closely the token sequences in one set of data, such as machine translation output, correlate with (match) the token sequences in another set of data, such as a reference human translation. See: evaluation process

build name

DoMY

A “build name” is a name assigned to a build set by the user. This name becomes a folder name under the lm folder or tm folder in the BUILDS

build process

DoMY

The “build process” copies and consolidates data from the “ready” stage in the CORPORA root type to a “build set” in the BUILDS root type. The build process is the final step to prepare a training corpus. The process finds data data that match the user's configuration, applies final tokenization for each language, consolidates multiple data files into the files necessary to start the training process.

build set

DoMY

A “build set” is a group of files that are the output of a build process. An lm build set becomes the input data to create a language model. A tm build set becomes the input data to train a phrase table and a reordering table.

build type

DoMY

A “build type” folder is a folder under the BUILDS root type that hosts an “lm” folder for lm build sets and a “tm” folder for tm build sets.

BUILDS folder

DoMY

The BUILDS tree is a root type that hosts lm folder and tm folder trees that hold build sets created in the build process.

CORPORA folder

DoMY

The CORPORA tree is a root type that hosts five pairs of “stage” folder trees. These stage trees store the output of various corpus cleaning and preparation processes.The CORPORA.demo root type is a temporary folder with output from the demo script that can be deleted.

CorpusFiltergraph

DoMY

“CorpusFiltergraph” is Precision Translation Tool's configuration-driven filter graph designed with parallel toolchains that to operate on various reams of text data in different languages. Users edit configuration files that define the sequence of filter plug-ins that process data.

corpus preparation

SMT

Corpus preparation is the general process to extract, transform, categorize various documents from their original purpose to and align the resulting data into a parallel corpus for training a translation model.

corpus type folder

DoMY

The tm folders and lm folder under a stage folder in the CORPORA root type are “corpus type” folders.

domain folder

DoMY

A domain is a sub-attribute of the superdomain. Folders under a domain tree in file system hierarchy hold data sharing the attribute. Domain depth folders are optional but strongly recommended. When present, the domain depth is one folder depth under a superdomain folder.

engine name

DoMY

The train-mert and train-eval scripts encodes their output folders with the command line options used for tuning and evaluation. The folder names of the mert and eval processes is referred to as the “engine name.” The engine name includes the source language and target language 2-letter codes, the tm build name, the word aligner, and n-grams size of the tables, the lm build name, the lm type, and n-grams size of the lm.

ENGINES folder

DoMY

A root type. The ENGINES tree hosts four subfolder trees with various completed components that constitute translation models. The four trees are: lms, tables, evals and recasers.

eval.* files

DoMY

The “eval.*” files are a pair of source and target language data, typically consisting of several thousand pairs of sentences. As part of the tm build set, they are used in the evaluation process. The eval data is also saved in eval*.sgm files used by the mteval-v12.pl program to generate a final BLEU score evaluation report for the tuned translation model.

evals folder

DoMY

The “evals” subfolder hosts translation model configurations and the final BLEU score evaluation report. The evals subfolder names use a convention described in the “engine name” entry.

evaluation process

SMT

The evaluation process uses a translation model of components created in the training process and configured with the tuning process to translate several thousand source language sentences in the eval files. This process then compares the resulting machine translations to reference translations, also in the eval files. The final BLEU score evaluation report shows how well the machine translations match the reference translations.

fa, fa-workbench folders

DoMY

A subfolder tree pair under the CORPORA root type that hosts “file-aligned” data. See: CORPORA, parallel data, aligned data

filter graph

DoMY

A “filter graph” is a data processing framework that divides tasks into sequences of fundamental programming tools called filters. Like a toolchain, the output of one filter connects to the input of another filter to perform more complex tasks. In addition, a filter graph operates on parallel streams of different data types while maintaining the alignment or synchronization between the different streams. The term “filter graph” originally referred to a framework for processing parallel streams of multimedia data types, such as audio, video and subtitles. Examples of multimedia filter graphs include Microsoft's DirectShow, Apple's QuickTime, and the Linux GNU Gstreamer. CorpusFiltergraph is a filter graph created by Precision Translation Tools for processing parallel linguistic corpora.

filter plug-in

DoMY

A filter class plug-in cleans, changes or otherwise accesses the monolingual data in the filter graph's data buffers. Filters apply language-specific or language-independent processes, but the filter graph always applies them on a language-specific basis. CorpusFiltergraph tracks two different toolchains of filters per language. It applies the “filter” toolchain before processing the aligner plug-ins. It applies the “postfilter” toolchain after processing the aligner plug-ins.

graph

DoMY

A “graph” is a CorpusFiltergraph configuration defining a sequence of fundamental processing steps that perform a specific data processing task such as extraction, cleaning, re-alignment or translation. The graph's name is one of the required domy command line options.

hierarchical model

SMT

SMT translation model that uses hierarchical training corpus.

hierarchical training data

SMT

A training corpus with each phrase annotated with the hierarchical structure of the language, such as parts of speech, word function, etc.

language model

SMT

A “language model” or “lm” is a statistical description of one language that includes the frequencies of token-based n-grams occurrences in a corpus. The “lm” is trained from a large monolingual corpus and saved as a file in the lms folder under the ENGINES root type. The language model file is a required component of every translation model. Moses uses language model to select the most “probably” target language sentence from a large set of “possible” translations it generated using the phrase table and reordering table.

language model types

SMT

Language model files contain statistical data generated by one of various programs. Moses Decoder can use language model file types including: KenLM SRILM, RandLM and IRSTLM. SRILM, RandLM and IRSTLM toolkits include tools that train the new language model files. KenLM, however, only reads ARPA standard language model files which can be created by SRILM, IRSTLM. DoMY can create language models in all these formats and configure Moses Decoder to use all.

lm build set

DoMY

An “lm build set” is a build set in the lm subfolder of the BUILDS root type created by the build process. The set has one file named “monotext.tt” with consolidated monolingual data, one sentence per line, for training to a language model. Anogher file named “recasetext.tt” with consolidated data, one sentence per line is for training to a recaser model.

lm folder

DoMY

CorpusFiltergraph supports two different lm folder types. Both lm folder types host un-aligned language model data. An lm folder under the BUILDS root type hosts lm build sets. An lm folder under a corpus type tree in the CORPORA root type hosts language model data in individual files in source and target language folders.

lm name

DoMY

The train-lm script encodes its output folder with the command line options used for training the lm. The folder name is referred to as the “lm name.” The lm name includes the target language 2-letter code, the lm build name, the lm type, and n-grams size of the lm.

lms folder

DoMY

The “lms” folder hosts trained language model files under the ENGINES root type. The lms subfolder names use a convention described in the “lm name” entry.

mert.* files

DoMY

The “mert.*” files are a pair of source and target language data, typically containing of several thousands of pairs. As part of the tm build set, created by the build process, they are used in the tuning process.

merts folder

DoMY

The “merts” folder tree in the TRAININGS root type hosts subfolders with temporary files used during the “minimum error rate tuning” tuning process. These temporary files are reused to resume a tuning if tuning was interrupted before completion. After tuning is complete, these temporary files are not reused and can be deleted without concern. The merts subfolder names use a convention described in the “engine name” entry.

moses configuration file

SMT

The moses configuration file is a text file created during the tuning process. The file contains the paths to the phrase table(s), reordering table, language model(s) with other codes and numeric values that control how the Moses Decoder works.

n-grams

SMT

An n-gram is a subsequence of n number of (1, 2, 3, etc) items in a larger sequence. In an lm n-grams are sequences of tokens. In phrase tables and reordering tables, n-grams are sequences of pairs of source and target language tokens.

pa, pa-workbench folders

DoMY

A subfolder tree pair under the CORPORA root type that hosts “paragraph-aligned” data.

parallel data

SMT

A linguistic corpus of two or more languages where each element in one language corresponds to an element with the same meaning in the other language(s). The original, authored language is identified as the source language. Non-source languages are referred to as “target” languages. For Moses SMT, parallel data takes the form of one source and one target language text file where both files contain corresponding translation of sentences line by line.

parallel corpus

SMT

See “parallel data

phrase table

SMT

A “phrase table” is a statistical description of a parallel corpus of source-target language sentence pairs. The frequencies that n-grams in a source language text co-occur with n-grams in a parallel target language text represent the probability that those source-target paired n-grams will occur again in other texts similar to the parallel corpus. In practical terms, the phrase table is a file created during the training process and saved as a file in the tables subfolder of the ENGINES root type. It functions as a sophisticated dictionary between the source and target languages. Phrase tables and reordering tables are translation model components.

pipeline

SMT

A “pipeline” is a toolchain of processes connected by standard streams, so that the output of each process (stdout) feeds directly as input (stdin) to the next one. See: CorpusFiltergraph

plug-in

DoMY

plug-ins are filters programming modules that can be assembled into parallel toolchains within the filter graph. The config.ini file for a graph defines which plug-ins and the order they are loaded. CorpusFiltergraph supports four plug-in classes: readers, writers, filters and aligners.

qc, qc-workbench folders

DoMY

A subfolder tree pair under the CORPORA root type that hosts data for a “quality-check” human review.

RAW folder

DoMY

The RAW tree is a root type that contains unstructured or user-defined folder hierarchies that serve as a temporary holding area to “import” data into the CORPORA root type tree. Also, data exported from the CORPORA root type tree is saved in the RAW tree. The RAW.demo root type is a temporary folder with output from the demo script that can be deleted.

reader plug-ins

DoMY

Reader class plug-ins serve as the input to CorpusFiltergraph because they read data from a data storage into data buffers for downstream filters. Current reader class plug-ins include “reader-file.py” that reads text data files from a hierarchical file system, and “reader-tmx.py” that reads <tuv> segments from a TMX xml file. Future reader class plug-ins could include a “reader-sql.py” plug-in to read data directly from a SQL database. Reader plug-ins are language-independent.

ready, ready-workbench

DoMY

A subfolder tree pair under the CORPORA root type that hosts data after human QC that is ready for consolidation into build sets.

recaser model

SMT

A recaser model is a special translation model translates lower cased data to “natural” cased text (upper and lower casing).

recaser name

DoMY

The train-recaser script encodes its output folder with the command line options used for training the recaser. The folder name is referred to as the “recaser name.” The lm name includes the target language 2-letter code, the recaser build name, the lm type, and n-grams size of the lm.

recasers folder

DoMY

The “recasers” subfolder in the ENGINES root type hosts complete recaser models. The recasers subfolder names use a convention described in “recaser name” entry.

render language

DoMY

The render language is the language of the text, regardless of source and/or render language. The render language is the same as the TMX specification “lang” attribute of the <tuv> tag. In CorpusFiltergraph, the render language is the deepest folder level in the CORPORA root tree type hierarchy.

reordering table

SMT

A “reordering table” contains the statistical frequencies that describe the changes in word order between source and target languages, such as “big house” versus “house big”. In practical terms, a “reordering table” is a file created during the training process and saved as a file in the tables subfolder of the ENGINES root type. The reordering table is translation model components.

root folder

DoMY

The root folder is the top-level folder tree that hosts all DoMY data. The default system rootfolder is “/opt/domy”. The users can override the default value in the ~/domy/domy-ce.ini file. Each graph's config.ini can also override the default. The root folder hosts six root types.

root type

DoMY

“Root type” folders are the folder trees under the root folder. Each root type tree has a different folder hierarchy. All subfolders within a root type share the same folder hierarchy. Valid root type folders include BUILDS, CORPORA, ENGINES, RAW, TRAININGS, and TRANSLATIONS.

sa, sa-workbench folders

DoMY

A subfolder tree pair under the CORPORA root type that hosts “sentence-aligned” parallel data.

source language

SMT

The source language is the language of the text that is to be translated. Typically, this is the authored language of the text. The source language is the same as the TMX specification “srclang” attribute of the <tu> tag. In CorpusFiltergraph, the source language is the second-deepest folder level in the CORPORA root tree type hierarchy.

stage folders

DoMY

“Stage” folders are pairs of trees under the CORPORA root type. The main halfs of the stage folder pairs are “fa”, “pa”, “sa”, “qc” and “ready.” Data segments on each text file line in the “fa”, “pa” and “sa” folders correspond to the TMX segtype attributes “block”, “paragraph” and “sentence” respectively. The paired “workbench” folder hosts data that failed any automated data quality checks. Decending depths into the hierarchy of stages includes superdomains, domains, subdomains, corpus type, source language, and target language folders.

subdomain folder

DoMY

A subdomain is a sub-attribute of the domain. Folders under a subdomain tree in file system hierarchy hold data sharing the attribute. Subdomain depth folders are optional but strongly recommended. When present, the subdomain depth is one folder depth under a domain folder.

superdomain folder

DoMY

A superdomain is the most encompassing attribute that describes corpora characteristics. Folders under a superdomain tree in file system hierarchy hold data that sharing the attribute. Superdomain depth folders are optional but strongly recommended. When present, the superdomain depth is one folder depth under a stage folder.

table name

DoMY

The train-tables script encodes its output folder with the command line options used to train the tables. This folder name is referred to as the “table name.” The table name includes the source language and target language 2-letter codes, the tm build name, the word aligner, and n-grams size of the tables.

tables folder

DoMY

The “tables” folder under the ENGINES root type holds trained phrase tables and reordering tables. The tables subfolder names use a convention described in the “table name” entry.

target language

SMT

The target language is the language the source language text should be translated to.

tm build set

DoMY

A “tm build set” is a build set in the tm subfolder of the BUILDS root type. Tm build sets consist of three pairs of parallel data files plus six supporting sgm/xml files needed to complete a training and tuning process.

tm folder

DoMY

CorpusFiltergraph supports two different tm folder types. Both tm folder types host aligned parallel data. A tm folder under the BUILDS root type hosts tm build sets. A tm folder under a corpus type tree in the CORPORA root type that hosts parallel data in source and target language folders.

tokenization

SMT

Tokenization is the process of separating words from punctuation and symbols into tokens.

tokens

SMT

Tokens are the basic unit in a machine translation process. Tokens are a sequence of characters, such as words, punctuation or symbols, separated by a space. See: BLEU score

toolchain

SMT

A “toolchain” is a series of linked or “chained” programming tools used in a series where the output of an upstream tool become the input for a “downstream” tool. See: CorpusFiltergraph

training corpus

SMT

A linguistic corpus with parallel data prepared for training into the phrase table and a reordering table components of a translation model.

training data

SMT

See: training corpus

training process

SMT

Training is a process in the machine learning branch of artificial intelligence field. In the training process, a system “learns” the relationships between parallel data. In SMT, the source language texts are stimuli that generate the target language text as a response. In practical terms, training starts with the bitext.* files and creates the phrase table and reordering table that are components of a translation model.

TRAININGS folder

DoMY

The TRAININGS tree is root type that hosts the “alignments” and “merts” subfolder trees.

translation engine

DoMY

A “translation engine” consists of a translation model, a supporting “translate” graph and a recaser model. The translate graph prepares the source language documents before translation and post-processes the target language data into usable text with the recaser model.

translation memory

SMT

A translation memory (tm) is parallel data that was collected for the purpose of aiding future translations.

translation model

SMT

A “translation model” consists of one or more phrase tables, zero or more reordering tables, one or more language models and one moses configuration file that were created during the training and tuning processes.

TRANSLATIONS folder

DoMY

The TRANSLATIONS tree is a root type that hosts subfolders that support translation and alignment of translated target language output with source language. Source language documents should be placed in the “lm” subfolder tree. Translated target language documents are saved in the “tm” subfolder tree with a copy of the source language.

tuning process

SMT

Tuning is a process that finds the optimized configuration file settings for a translation model when used a specific purpose. The tuning process translates thousands of source language phrases in the mert* files with a translation model, compares the model's output to a set of reference human translations, and adjusts the settings with the intention to improve the translation quality. This process continues through numerous iterations. With each iteration, the tuning process repeats the steps until it reaches an optimized translation quality.

words

SMT

A word is the smallest unit of meaning in a language that will stand on its own. In SMT, a word is a token created in the tokenization process that is not a punctuation or symbol.

word aligner

SMT

A word aligner is a program that created word alignment files during the word alignment process. Moses currently supports these word aligners: GIZA++, MGIZA++, and BerkeleyAligner. DoMY uses MGIZA++ by default.

word alignment

SMT

Word alignment process uses a word aligner to create a word alignment file saved under the alignments folder in the TRAININGS root type during the training process.

workbench folder

DoMY

Folders with the “workbench” suffix under the CORPORA root type are paired with a stage of processing, such as “sa” and “sa-workbench”. Workbench folders contain data that failed an automated quality control process for its stage pair and should be reviewed by an editor.

writer plug-in

DoMY

Writer class plug-ins serve as the output from CorpusFiltergraph because they write data from upstream data buffers to an storage mechanism outside CorpusFiltergraph. Current writer class plug-ins include “writer-file.py” that writes text data files to a hierarchical file system, and “writer-tmx.py” that writes XML files compliant to the TMX version 1.4 specification with sentence segment types. Future writer class plugins could include a writer-sql.py plugin to save data directly to an SQL database. Writer plug-ins are language independent.

DoMY Glossary v 1.0
2011-10-22 21:15
Copyright © 2011 Precision Translation Tools Co., Ltd.