[Guest Post] The Need For MT Evaluation

After I published my review of Slate Desktop few weeks ago, I received some interesting comments. Specifically my statement

If you’re a translator who works with texts that are not confidential, your best bet would be to choose Google Translate (of course, it also depends on your language pairs). If you translate mostly confidential texts and your clients require maximum privacy, then SD could help you leverage your language data.

seems to have intrigued some readers. Is GNMT really better than a Slate engine? Why build a SMT engine when NMT is already available? …. and other questions. I have a post planned about all this as soon as possible.

I have kept my test of Slate Desktop as basic and as quick as possible to see if it was really simple to build an SMT engine.  But of course, the devil is in the details. So, I invited Tom Hoar, Slate’s CEO, to give a more technical overview of what needs to be taken into account when building and evaluating your own engine. We also agreed on a test to see who was the best: Slate or GNMT. You can read it in the guest post underneath.

By the way, if you’re a blogging translator and are willing to invest some time in testing Slate, you might want to join Slate’s Blogging Translator Review Program.


The Need For MT Evaluation – by Tom Hoar

Subjective observations of machine translation (MT) linguistic quality are simple and easy for 35-40 words in a few example segments, but they reveal nothing about long-term translation quality or the translator’s experience across several projects of 10,000 words each.

A truly objective, accurate and automated evaluation of MT linguistic quality is beyond today’s state of the art. In fact, this deficit is what leads to the poor quality of MT output in the first place. This doesn’t mean MT is useless because translators are using MT every day.

What are MT evaluations good for if they can’t accurately report a translation’s quality?

Slate’s evaluation scores do not tell you about the quality of an engine’s translations. Instead, Slate focuses on describing engine criteria can be measured objectively. Here, I generically refer to these criteria as an engine’s “linguistic performance.” The scores indicate how an engine might reduce or increase a translator’s workload compared to another engine. With objective evaluation scores, you can better predict how an engine might affect your work efficiency in the long term.

So, let’s look at the best practices of MT evaluation. Then, I’ll review Isabella’s engine scores with a focus on how they relate to her client’s work. Finally, I’ll compare Google’s output from the same evaluation segments with Isabella’s engine results.

Evaluation Best Practices

Current MT evaluation best practices require an evaluation set with 2,000-3,000 source-target segment pairs. The source segments represent the variety of work that the translator is likely to encounter. The target segments represent the desired reference translations.

The evaluation process uses the MT engine you’re evaluating to create “test” segments from the evaluation set’s source segments. It then measures each “test” segment against its respective “reference” and assigns a “closeness” score. These are like fuzzy match scores, but for target-to-test segments not source-to-TM segments. The process accumulates individual scores, like an average, to describe how the engine performed with that evaluation set. A performance descriptions for one engine has some value, but it’s much more valuable to compare descriptions of one evaluation set from different engines to tell us which engine performs better.

Measuring Isabella’s Engine

Isabella reported she started with three .tmx files and 250,768 segment pairs from the same client since 2003. Her Engine Summary (image below) shows Slate built Isabella’s engine from 119,053 segments after it removed 131,715 segment pairs (53%) for technical reasons. You can learn more about translation memory preparation on our support site.

Slate randomly removed and set aside 2,353 segment pairs that represent Isabella’s 14 years of work as the evaluation set leaving only 116,700 pairs to create the engine’s statistical models. During the evaluation process, the source segments are like a new project from the engine’s viewpoint. That is, the engine is not recalling segments that were used to build it. This evaluation strategy gives a 95% confidence that the engine will perform similarly when Isabella gets a new project from this client.

Isabella’s Engine vs Google

Before I could compare the performance of Isabella’s engine to Google, Isabella graciously granted me permission to translate her evaluation set’s 2,353 source segments using Google Translate. Here are Google’s evaluation scores side-by-side with Isabella’s.


Evaluation Set

Segment count 2,353
Average segment length (words per segment) 16.5


Evaluation Scores Google Translate
Date  2017-08-11 2017-07-29
Evaluation BLEU score (all)  33.07 69.33
Evaluation BLEU score (1.0 filtered)  32.47 61.82
Quality quotient  4.33% 29.75%
Edit Distance per line (non-zero)  42 32
Exact matches count  102 700
Edit Distance entire project  93,605 52,856
Average segment length (exact matches) 4.7 11.4


This Engine Summary table includes a variety of scores, but these are the three that I rely on the most: the Average sentence length, the Quality quotient, and the Evaluation BLEU score (1.0 filtered).

The average segment length of source segments in the evaluation set tells us if Isabella’s translation memories are heavily weighted with terms, such as from a termbase. Isabella’s 16.5 average above is normal and the translation memories likely include a good balance of short and long segments. If the average were very small (for example 5 words), the engine will work poorly with long sentences.

The quality quotient (QQ) score means its likely that Isabella will simply review up to 30% of segments as exact matches when she works with her engine and her client’s future projects. Exact matches with this engine are 7 times more likely than if she did the same work with Google.

The evaluation BLEU score (filtered) represents the amount of typing and/or dictation work Isabella will need to do when her engine fails to suggest an exact match. Her engine’s score of 61.8 indicates her engine’s segments are likely to require less work than segments from Google with a score of 32.5. It’s important to note that this evaluation set’s Google BLEU score is comparable to Google scores with other published evaluation scores.

Putting It All Together

Isabella described her translation memories as client-specific with mostly her translations, those of a trusted colleague and some from unknown colleagues. She said, “All in all, a great mess” because they contain some terminological discrepancies, long convoluted segments, and other one-word long segments. She created her engine on her 4-year-old laptop computer in less than a day without any specialized training.

Isabella’s evaluation set is a representative subset of the corpus that Slate created to build the engine. The evaluation set’s scores show that her engine significantly outperforms Google Translate in every measured category. Furthermore, because of how Slate created the evaluation set and her translation memories are primarily her work specific to her client, she has a 95% likelihood of experiencing similar performance with future work from that client.

When Isabella works on projects with Slate, her engine is likely to give her 7 of 10 segments that require changes (the converse of the QQ). Like many users, she might find these suggestions overwhelming because she’s accustomed to the CAT hiding the suggestions from poor fuzzy matches. Still, 70% represents much less work than the 96% she would likely receive from Google. With a little practice, it’s easy and fast to trash segments that require radical changes and start from scratch.

There’s no way to predict how her engine will perform with work from other clients or other subject matter. The nature of the statistical machine translation technology tells us that the performance will degrade as a project’s linguistic contents diverge from her engine corpus’ contents. Isabella’s engine could drop significantly for projects with disparate linguistic content. Fortunately, Isabella controls her engine and Slate gives her some tools to clean up the “great mess,” for example by forced terminology files to resolve the terminological discrepancies.

This was her first engine and she can experiment to her heart’s content. She can create as many engines as she likes. She can mix various translation memories and compare their performance, much like I compared her engine to Google in this article. Furthermore, she can experiment without any additional cost. If she has translation memories for five clients, she can create one engine for each of them or one that combines all. I look forward to hearing about her experiments.

When using Google Translate, Isabella needs to wait for Google to update and improve their engine. For example, her Google results reflect their recent update their en-it engine to NMT and these scores reflect those improvements. To Google’s credit, it handles variations across different subjects better than Isabella’s engine likely will. As Isabella pointed out, Google “has been constantly improving since inception.” So, across many different subjects, Google will continue to deliver 4% to 5% exact matches.

Fortunately, Isabella doesn’t face an either-or decision. Isabella’s first Slate Desktop engine performs well with her client’s projects, but we don’t know how it will perform with other projects. Fortunately, it costs her nothing to try it or improve it. Finally, she can also use Google whenever she feels it might be beneficial.