Web Align Toolkit

Online parallel texts aligner and format converter


WAT allows to align parallel texts using various aligning engines:

  • LF Aligner: a fast and reliable state-of-the-art aligner based on Hunalign, which integrates various bilingual lexicons (and which can also align more than two texts at once)
  • YASA: a fast and reliable state-of-the-art aligner (reliable and very fast)
  • JAM: a multi-aligner that can align more than two texts at once (beta version that needs some bug fix)
  • Alinea Lite: a simple aligner written in Prolog that uses cognates and sentence lengths (robust but slow)
  • Note: when providing various version of the same text, in one language, and using LF Aligner, the BLEU score is computed. The first text is considered as the hypothesis, and the other ones as the reference translations. The BLEU score is available in a separate file in the output directory.

Just paste source and target texts in the textarea below. To align more than two texts, just use the "Upload files and run" tab


Align with

WAT allows to align parallel texts using various aligning engines:

  • LF Aligner: a fast and reliable state-of-the-art aligner based on Hunalign, which integrates various bilingual lexicons (and which can also align more than two texts at once)
  • YASA: a fast and reliable state-of-the-art aligner (reliable and very fast)
  • JAM: a multi-aligner that can align more than two texts at once (beta version that needs some bug fix)
  • Alinea Lite: a simple aligner written in Prolog that uses cognates and sentence lengths (robust but slow)
  • Note: when providing various version of the same text, in one language, and using LF Aligner, the BLEU score is computed. The first text is considered as the hypothesis, and the other ones as the reference translations. The BLEU score is available in a separate file in the output directory.

Just drag and drop files below. Corresponding files must share the same name, and differ only by the language extension. Example : balzac.fr.txt, balzac.en.txt

Please use the following extensions : txt for plain text, ttg for treetagger output, ces for cesAna or cesAlign format, xml for XML format. A zip archive may be uploaded to process many files at once.

Add files...
 
Input format : Output format : with Languages :

Notes

  • The maximum file size for uploads in this version is 999 KB.
  • Only TXT, TTG (treetagger), XML (segmented and/or tokenized text), TMX, CES files are allowed.
  • Uploaded files will be deleted automatically after 1 day.
  • You can drag & drop files from your desktop on this webpage (see Browser support).
  • Please refer to [olivier kraif At univ-grenoble-alpes fr] for more information.

Below, find the aligner outputs.

Click on the file to open it in the browser. To download it, right click -> "save link target". The input files appears at root directory.
The "seg" subdirectory contains the segmented input files (when the option "Split paragraphs into sentences" is selected).

The following parameters have to be adjusted if you use different naming convention, file encodings, etc.
 
The paragraphs are split according to [.?!] + upperCase and [;:] marks (using an abreviation dictionary).
The encoding of the input files.
The string pattern that describes how file names are made. The default value corresponds to names such as : alice.en.txt, alice.fr.txt, alice.es.txt
When using YASA or Alinea. In complement to file name pattern, this regex pattern describes where is defined the language in the file names. The capturing parentheses define the language code in the name. The default value corresponds to names such as : alice.en.txt, alice.fr.txt, alice.es.txt.
The string that describes how to build the output aligned file. The default value corresponds to names such as : alice.en-fr.tmx
For not fairly parallel texts, with large gaps, the search space have to be enlarged and the radius has to be raised.