In the land of a low-data language


This article was originally published on the TAUS blog.

A country of some 26,000 sq mi (68,000 sq km) in the central area of the European region of Russia, with a population of 3.8 million people, Tatarstan is the third largest oil-producing region in Russia (over 30 ml tons of crude oil per year). In 2011, the per-capita gross regional product amounted at approx. US$11,206.

The Tatar model

In addition to oil and other natural resources, the country has a relatively high credit rating, reflecting a low risk of investment: in 2011, Forbes and Ernst & Young ranked Tatarstan as the best region for doing business in Russia. Tatarstan is also developing a strong technological edge, as well, for example with the opening of a privately funded Techno Park in Kazan in 2014.

This growth potential is partly due to the “Tatar model”, the bilateral treaty that Tatarstan and Russia signed in 1994 confirming a substantial amount of financial and juridical autonomy for the Tatar republic. The treaty between the two parts is also aimed at promoting collaboration and maintaining the Tatar identity.

Linguistic details

Tatar is a so-called Turkic language, with influences coming from Arabic, Chinese and Russian. According toWikipedia, the number of speakers is in the region of 5.3 million. Tatar’s closest relative is Chulym, one of the languages “saved” by the linguists turned movie stars, David K. Harrison and Gregory Anderson. Unlike Chulym, Tatar is protected by a state program and is recognized as one of the official languages of the Republic of Tatarstan, but, as the title of an article on machine translation published on the Russian website Business Newsproclaims, “A language that is not included in iPad and iPhone is doomed”.

Promoting growth

In this setting, how do you promote economic and technological growth? The answer is – you guessed it – with machine translation.

At the end of January 2015, a few days after the opening of an office in Kazan, ABBYY Language Services (ABBYY LS), the division of the ABBYY Group, won a tender to provide the technological infrastructure for a “republic-scale” project aimed at the development of a Moses-based Russian/Tatar machine translation system. Anna Sidorova (Head of Marketing at ABBYY Language Services) explains: “For our standards, the Tatar translation market is huge, with a value of approx. 1 billion rubles [approx. US$ 3,000,000). But most of all, it is a great chance for us to promote the development of the region and the innovative wave of the Tatar Republic, as well as to create an efficient translation infrastructure.”

The project will span from 2015 through 2020 with the participation of 40 linguists, representing the Tatarstan Academy of Science, that is going to manage the project. ABBYY LS estimates that, thanks to the new technological edge, the linguists’ performance will register a twofold increase in productivity.

The first phase of this project is the collection of a parallel corpus of at least 100 million word phrases, mainly based on bilingual governmental documents.

The estimated funding amounts to 7 million rubles (approx. US$ 100,000), although more funding will probably be necessary, as the numerous variables of the project will come to light, for example, following the integration in the second year of a rule-based machine translation system with semantic and syntactic parsing.

The machine translation system developed by ABBYY LS will provide a double interface, for professional and non-professionals users. The non-professional interface is destined for governmental employees who need to have a document machine-translated on the fly. This interface will also include the option of adding translation memories and glossaries. Non-professional users will be asked to provide feedback and, if possible, to post-edit the raw MT output.

The essence of the professional interface will be SmartCAT, the CAT tool developed by ABBYY. Among other things, this tool allows updating translation memories in real time and processing PDF files. Recently SmartCAT has been expanded with new features (more about this coming soon).

During the second phase of the project, the National Corpus of the Tatar Language – a monolingual corpus with grammatical annotation of words – will be added to provide the morphological model. The corpus – consisting mainly of fiction and nonfiction texts – is currently implemented on the EANC technological platform, originally created for the Armenian language.

Meanwhile in the academic world…

The academic world has focused its attention on machine translation both from the linguistic point of view and from the point of view of artificial intelligence and machine learning.

Take a look at the Apertium website and you will find that in the framework of the Google Summer of Code 2014 a M.Sc. student at the University of Stuttgart, Ilnar Salimzyanov, worked for three months on a rule-based machine translation system for Russian/Tatar. Currently Salimzyanov is simply maintaining the system, which means avoiding regressions due to changes in the morphological analyzers for Tatar and Russian, two separate modules of the system used by other translators as well.

Tech bits

In a brief e-mail interview, Salimzyanov kindly included some links that might interest the more tech inclined readers. For example, here you can find the current state of the translation system. In the “Current state” table you can see that the bilingual dictionary contains 6000 stems, which means that the system is still in a prototype phase.

The page on the regression tests (tests aimed at ensuring that any changes made to the decoder do not break what has already been determined to be correct) indicates the types of phrases and clauses Salimzyanov wrote rules for. For the given Tatar phrases/clauses on the left, the current output of the translator is exactly as specified on the right-hand side of the “equations”.

Finally, this is the small corpus used for developing rules. The file tat-rus-nova.txt contains the current output of the translator when translating the Tatar corpus through it.

Technology put to good use

We agree with Evgeny Morozov: technological solutionism is not always the best nor is it the only option to solve the world’s problems. Nevertheless, the scenario presented here might reassure most linguists: those who deplore the death of minor languages (like David Crystal) and those who fear that economic growth could contribute to the extinction of minority languages. At the same time, the Russian/Tatar machine translation project confirms the theory formulated by Nicholas Ostler, who declared machine translation the new lingua franca.

Also, the cooperation between the academic world and the business sector is not completely missing in this case, and it can contribute to the development into both linguistic and economic fields  To make it happen more often and more fruitfully, TAUS  bridges the gap between two side by acting as a meeting point between universities and technological companies.


Sign up for my monthly
#SmartReads on the Translation Industry

    Your email is safe with me and I will never share it with anyone.