Web Align Toolkit

WAT allows to align parallel texts using various aligning engines:

LF Aligner: a fast and reliable state-of-the-art aligner based on Hunalign, which integrates various bilingual lexicons (and which can also align more than two texts at once)

YASA: a fast and reliable state-of-the-art aligner (reliable and very fast)

JAM: a multi-aligner that can align more than two texts at once (beta version that needs some bug fix)

Alinea Lite: a simple aligner written in Prolog that uses cognates and sentence lengths (robust but slow)

Note: when providing various version of the same text, in one language, and using LF Aligner, the BLEU score is computed. The first text is considered as the hypothesis, and the other ones as the reference translations. The BLEU score is available in a separate file in the output directory.

Just paste source and target texts in the textarea below. To align more than two texts, just use the "Upload files and run" tab

WAT allows to align parallel texts using various aligning engines:

LF Aligner: a fast and reliable state-of-the-art aligner based on Hunalign, which integrates various bilingual lexicons (and which can also align more than two texts at once)

YASA: a fast and reliable state-of-the-art aligner (reliable and very fast)

JAM: a multi-aligner that can align more than two texts at once (beta version that needs some bug fix)

Alinea Lite: a simple aligner written in Prolog that uses cognates and sentence lengths (robust but slow)

Note: when providing various version of the same text, in one language, and using LF Aligner, the BLEU score is computed. The first text is considered as the hypothesis, and the other ones as the reference translations. The BLEU score is available in a separate file in the output directory.

Just drag and drop files below. Corresponding files must share the same name, and differ only by the language extension. Example : balzac.fr.txt, balzac.en.txt

Please use the following extensions : txt for plain text, ttg for treetagger output, ces for cesAna or cesAlign format, xml for XML format. A zip archive may be uploaded to process many files at once.

Notes

The maximum file size for uploads in this version is 999 KB.
Only TXT, TTG (treetagger), XML (segmented and/or tokenized text), TMX, CES files are allowed.
Uploaded files will be deleted automatically after 1 day.
You can drag & drop files from your desktop on this webpage (see Browser support).
Please refer to [olivier kraif At univ-grenoble-alpes fr] for more information.

Below, find the aligner outputs.

Click on the file to open it in the browser. To download it, right click -> "save link target". The input files appears at root directory.
The "seg" subdirectory contains the segmented input files (when the option "Split paragraphs into sentences" is selected).

The following parameters have to be adjusted if you use different naming convention, file encodings, etc.

Split paragraphs into sentences (for txt format only)
The paragraphs are split according to [.?!] + upperCase and [;:] marks (using an abreviation dictionary).

File encoding The encoding of the input files.

File name pattern The string pattern that describes how file names are made. The default value corresponds to names such as : alice.en.txt, alice.fr.txt, alice.es.txt

Language pattern When using YASA or Alinea. In complement to file name pattern, this regex pattern describes where is defined the language in the file names. The capturing parentheses define the language code in the name. The default value corresponds to names such as : alice.en.txt, alice.fr.txt, alice.es.txt.

Align file name The string that describes how to build the output aligned file. The default value corresponds to names such as : alice.en-fr.tmx

Radius of the search space around anchor points (only for YASA) For not fairly parallel texts, with large gaps, the search space have to be enlarged and the radius has to be raised.

Web Align Toolkit

Online parallel texts aligner and format converter

WAT allows to align parallel texts using various aligning engines:

WAT allows to align parallel texts using various aligning engines:

Notes

Below, find the aligner outputs.