Data Augmentation with Translation Memories for Desktop Machine Translation Fine-tuning in 3 Language Pairs

This study aims to investigate the effect of data augmentation through translation memories for desktop machine translation (MT) fine-tuning in OPUS-CAT. It also focuses on assessing the usefulness of desktop MT for professional translators. Engines in three language pairs (English → Turkish, English → Spanish, and English → Catalan) are fine-tuned with corpora of two different sizes. The translation quality of each engine is measured through automatic evaluation metrics (BLEU, chrF2, TER and COMET) and human evaluation metrics (ranking, adequacy and fluency). Overall evaluation results indicate promising quality improvements in all three language pairs and imply that the use of desktop MT applications such as OPUS-CAT and fine-tuning MT engines with custom data in a translator’s desktop can potentially provide high-quality translations aside from their advantages such as privacy, confidentiality and low use of computation power.


Introduction
The increased quality of machine translation (MT) output since the advent of neural MT (NMT) has led to its integration into many translation and localization workflows, to the extent that MT post-editing "has now become the rule rather than exception in localization" (Esselink 2022: 90).As with other historic changes in translation production, this change has been mostly top-down, and Esselink feels that MT has been 'reluctantly' accepted.This chimes with discourse about the loss of agency (Abdallah 2012) on the part of translators when MT is unilaterally imposed rather than introduced using a participatory approach, which in turn has repercussions for translator morale and industrial sustainability.In this scenario, translators receive pre-populated MT output to post-edit, having had little or no input into the appropriateness of MT for their task and the training data used when preparing the system (Cadwell et al. 2018).
This article sets out an alternative scenario, in which translators themselves build their own free and open-source custom desktop NMT system to work within their familiar translation editing environment; thus, NMT becomes an empowering tool under their own control.We provide guidelines for system fine-tuning by professional translators and build on this by investigating the effects of data augmentation to ascertain the effectiveness of different amounts of data on the quality of a local NMT system.Since these NMT systems run locally, they can keep translated data secure to avoid it leaking externally, and being customizable, engines can be fine-tuned to potentially improve translation quality and consistency by adapting to the translation memories (TMs) of the translators.We use OPUS-CAT (Nieminen 2021), a software collection that provides these capabilities through pretrained NMT models (Tiedemann & Thottingal 2020) and a fine-tuning feature, with plugins to integrate with many CAT tools including OmegaT 1 , Trados 2 and memoQ 3 .
The evaluation section of this article aims to measure the quality improvements, if any, in MT engines in the localization domain, fine-tuned with differently sized custom corpora.The engines are trained in English → Turkish, English → Spanish, and English → Catalan using the OPUS-CAT MT application running on the Windows operating system.While quality improvements through data augmentation are foreseeable, our aim is to show that these improvements are feasible not only in a supercomputer environment but also on a personal computer, and to explore the effect of different fine-tuning corpus sizes with the objective of guiding professional translators across various language pairs.The translation quality of each engine is measured using four automatic evaluation metrics, namely BLEU (Papineni et al., 2002), chrF2 (Popović 2015), TER (Snover et al. 2006) and COMET (Rei et al. 2020), along with human evaluation metrics (ranking, adequacy and fluency).
Pretrained NMT models from OPUS-MT are mostly trained on mixed domain corpora, therefore specific TMs need to be added to adapt the style and terminology to different specific domains.However, empirical studies on how large domain-specific TMs should be to provide significant quality improvements in the relevant domain are needed.Our study concentrates on the localization domain in three language pairs and measures translation quality in three scenarios: i) no fine-tuning, ii) fine-tuning with a bilingual localization corpus of 500,000 source words and iii) fine-tuning with a bilingual localization corpus of more than 2,000,000 source words.Evaluating the quality of each engine with both automatic and human evaluation metrics allows us to observe how adding custom parallel corpora affects MT translation quality.
Translation corpora are obtained from Microsoft Visual Studio's Translation and UI Strings 4 for the English → Turkish and Spanish → Turkish language pairs, and from SoftCatalà for the English → Catalan language pair.These are compiled as TMs to be used as fine-tuning corpora.A total of 210 sentences in the localization domain are selected for automatic and human evaluation tasks.Human evaluation is conducted using three metrics (adequacy, fluency and ranking) within the KantanLQR 5 platform by three reviewers per language pair.KantanLQR allows for customizing quality evaluation metrics to be used for evaluation and provides an interface for evaluation tasks to be streamlined together with a dashboard for a quick overview of the results.
While we expect to observe quality improvements with each additional localization corpus, fine-tuning does not necessarily guarantee such an improvement.Our findings will provide insights for translators who would like to build and manage their own secure MT systems, effectively augmenting MT with their domain-appropriate data.It should be noted that usefulness in the context of this study is taken from a broader perspective, seeing MT as a resource in the workflow of the translator, not necessarily concerned with productivity gains through higher quality MT engines, but also highlighting tertiary issues such as control over data, transparency, and confidentiality.Nonetheless, improvements in the MT performance following fine-tuning steps by professional translators may also imply more usefulness.

Related Work
Research on translator interaction with NMT has tended to focus on productivity or quality rather than its "usefulness… as a tool for professionals", focusing instead on improving the NMT systems themselves (Ragni & Nunes Vieira 2022: 153).Research on human factors in MT, for example, tends to focus on post-editing effort and productivity, although measurement of keystrokes or their approximation using the Humantargeted Translation Edit Rate (HTER; Snover et al. 2006) metric gives an indication of the usefulness of MT.Studies on user interfaces (UIs) for translator interaction with MT aim to make MT more useful so that interactions become more user friendly with reduced cognitive friction (e.g.Moorkens and O'Brien 2017;Herbig et al. 2020).However, the usefulness of MT is again not the focus of such research.
Studies such as those of Kenny andDoherty (2014), Martín Mor (2017), Ramírez-Sánchez et al. (2021) and Kenny (2022) have highlighted the didactics of teaching MT to translators.Free and open-source platforms such as MTradumàtica 6 (for statistical MT) and MutNMT 7 (for NMT) have allowed translators to experiment with all steps of MT training in an experimental environment.The availability of these platforms helps professional translators understand the capabilities and limitations of MT, and make informed decisions about their uses of MT.These platforms are built for educational purposes, for students and professional translators who would like to integrate MT into their workflow using a stable, easy-to-use and flexible tool.
The convergence of different projects within the OPUS platform (Tiedemann et al. 2022) such as OPUS Corpus (Tiedemann 2012), OPUS-MT (Tiedemann and Thottingal 2020) and OPUS-CAT (Nieminen 2021) has, among other things, paved the way for translators to use MT in different ways in their workflow.The release of OPUS-CAT has particularly bridged the gap between MT research and professional use of MT by translators.OPUS-CAT is a software collection with a graphical UI that runs on Windows; it allows translators to use pretrained NMT models from OPUS-MT and fine-tune them with their TMs (or TMs from their clients or other sources of free and open-source corpora) and connect them into their CAT tool environment.Such a setup has many advantages.For example, it lets the translator assume control of the MT system and to regularly update MT engines with TMs without allowing client data to leak to third parties.Furthermore, the presence of pretrained NMT models decreases computational costs and allows for reuse of these models, which reduces environmental and energy cost.This setup helps to solve some concerns related to transparency, confidentiality, unethical data use, and privacy as highlighted in Moorkens andLewis (2019) andMoorkens (2022).
Finally, localization (Esselink 2003) has been one of the fastest-growing domains in the language industry.However, there are few academic studies on the domain in general (Jiménez-Crespo 2020; Ramos et al. 2022).It is particularly hard to find studies that focus on the use of MT in localization scenarios.The study herein aims to provide baseline results from different language pairs on MT and localization, a domain which is characterized by inline format tags, variables, adaptation aspects, short strings, and context dependencies, all of which are known to cause problems for MT.

Methodology and Research Design
Three types of MT engines were used or created per language pair.The first type of engine is a pretrained model from OPUS-MT (Tiedemann and Thottingal 2020).It was downloaded to OPUS-CAT (Nieminen 2021) through the built-in feature "Install OPUS Model from Web".Once the download was complete the engine was ready for translation.This type of engine is referred to as the "baseline model" throughout the present study.The pretrained models have the advantage of not requiring the end user to train an engine from scratch.This means that translators do not need to spend huge amounts of money on expensive hardware for NMT training or for electricity for resource-intensive computation during training.Once trained, pretrained NMT engines can be used and shared without the need to repeat this process, making them more environmentally friendly and sustainable (Tiedemann et al. 2022:1).
English → Turkish 8 , English → Spanish 9 and English → Catalan 10 baseline pretrained NMT models are hosted in the GitHub repository of the Language Technology Research Group at the University of Helsinki.The second type of engine was created by fine-tuning these baseline models with a localization corpus of approximately 500,000 source words extracted randomly from the larger versions of the corpora.Finally, the third type of engine was created by fine-tuning the baseline model with a localization corpus that has between 2,300,000 and 2,700,000 source words.The exact source and size of each corpus is described in Section 3.1.
Fine-tuning was conducted within OPUS-CAT by selecting the baseline model ("Fine-tune selected model"), importing the relevant TMX file and providing a specific name to the prospective fine-tuned engine in the next window and clicking the "Fine-tune" button.With the default fine-tuning settings (a single thread and a workspace of 2048 MB; stopping after one epoch; learning rate: 0.00002), the training time varies and can last for long durations depending on the size of the fine-tuning corpus and the computational power used.In our study, we use a laptop with 16 GB RAM, GEForce MX150 graphic card (total available graphic memory: 10183 MB), and Intel Core i7-8550 CPU processor.With these specifications and default fine-tuning settings in OPUS-CAT, it takes approximately 4 hours to finetune with 500,000 source words and approximately 10 hours with 2,000,000 source words.It is possible to change fine-tuning parameters such as epochs and learning rate.For this study, we kept the default settings, assuming that a translator using OPUS-CAT would not make any change to these parameters.

Corpus Statistics
Aside from the baseline model that does not include additional fine-tuning, the study involves fine-tuning pre-trained engines with two different localization corpus sizes: 500,000 source words and more than 2,000,000 source words.Parallel corpora in English → Turkish, English → Spanish and English → Catalan were compiled from resources available on the web and used in the TMX format.English → Turkish and English → Spanish corpora were obtained from Microsoft Visual Studio's Translation and UI Strings.These corpora are available as sets of various CSV (Comma Separated Values) files.The files were consolidated into a single TMX file using memoQ's multilingual delimited text filter which allows conversion of a bilingual spreadsheet into TMX in a few steps.The English → Turkish corpus includes 2,300,000 source words while English → Spanish includes 2,700,000 source words.Corpora sizes across language pairs differ since the original source files in Visual Studio are of different sizes depending on the language pair.The study aimed to use all available corpora to the extent possible.While we tried to keep corpora sizes similar, it was not necessary for them to be the same since the main objective of the study is not to make comparisons across language pairs but focuses on quality improvements through data augmentation.Hence, different sizes in the large corpus scenario may provide different insights for professional translators.
The English → Catalan corpus in Microsoft Visual Studio, containing less than 800,000 source words, was deemed too small for fine-tuning in this language pair and therefore not utilized.Instead, TMs from Softcatalà 11 , a nonprofit association that localizes free and open-source applications into Catalan, were compiled as a single TMX file to yield a larger corpus.Localization projects realized by this initiative include Mozilla, Bitcoin, Libre Office, Ubuntu, WordPress among others.The resulting TMX file used in the present study has 2,300,000 source words.
Out of these three large corpora, approximately 500,000 source words were copied and saved as separate, smaller corpora.These smaller corpora were used for fine-tuning the pretrained engines.Subsequently, larger versions of the TMX files were used to fine-tune the baseline model.Table 1 provides the detailed statistics of each engine, corpora sources, and engine names.One file from the Microsoft corpus was not used for fine-tuning and was instead allocated for automatic and human evaluation.This file was present in the three target languages; hence we were able to use the same file for the evaluation tasks.In each target file, any segments that exist in the large corpus for fine-tuning or repetitions within the file were omitted and only unique segments were left.While most of the source segments were the same across three language pairs, the segment omitting steps led to slight changes.Hence, the 210 segments are not exactly the same across language pairs.These 210 segments were then selected for evaluation for each language.Corpora for all language pairs are available in GitHub 12 (together with evaluation test set and evaluation results).

MT Evaluation
Both automatic and human evaluation were employed to evaluate quality.BLEU, chrF2, TER and COMET were the automatic evaluation metrics used with human reference translations.The automatic evaluation was completed using the MATEO 13 platform (Vanroy et al. 2023) by uploading sample MT outputs and human translations one by one.MATEO has the advantage of providing confidence intervals and p-values for detecting significant differences between baseline engines and fine-tuned engines.Once the automatic evaluation was complete, human evaluation by professional translators was initiated.
Three professional translators per language pair participated in the evaluation tasks.All reviewers are native speakers of the target language and have extensive experience in the translation industry.Four reviewers reported more than 10 years of experience, two of them have 5-10 years of experience, two of them have 3-5 years, while one reviewer has 1-2 years of experience.All instructions (see Annex I) for the evaluation task were sent to the reviewers via email and a complete list of instructions about the evaluation platform was provided.They completed the three evaluations (ranking, adequacy, and fluency) together, one segment at a time.
The translators were asked to rank the three MT outputs from the best to the worst by assigning three points to the highest-performing engine and one point to the worst performing.The order in which MT outputs was shown was randomized to avoid biases towards any engine.Once a rating was completed for a segment, the translators moved to the next window to rate the following segment.If MT outputs were considered to be of identical quality for two or more engines, equal scores were permitted.Then, translators rated the adequacy and fluency of the output using a scale of five, where five was the highest score and one the lowest.The tasks were completed within the interface of the KantanLQR platform.The interface showed one source segment and three MT outputs as well as ranking, adequacy, and fluency rating options, as may be seen in Figure 1.Table 2 shows the definitions and rating scales for adequacy and fluency according to KantanLQR.Reviewers had access to this information each time their mouse hovered over the "i" icon next to adequacy and fluency.

Adequacy
Fluency Adequacy measures how much meaning is expressed in the machine translation segment.It is measuring whether the machine translation segment contains as much of the information as a human translation.
Fluency is checking that the translation follows common grammatical rules and contains expected word collocation.This category scores whether the machine translation segment is formed in the same way a human translation would be 1-None of the meaning expressed in the source fragment is expressed in the translation fragment.

Results
This section includes automatic and human evaluation results.Firstly, overall evaluation results are presented, and then a breakdown is reported per language pair.

Automatic Evaluation Results
Automatic evaluation results show how the performance of each engine differs according to BLEU, chrF, TER and COMET metrics when either a small or large corpus is used for fine-tuning.Table 3 provides the results of the automatic evaluation for each language pair.
System 1 is the baseline, System 2 fine-tuned with a small corpus added, and System 3 with the larger corpus.As indicated in MATEO, p-values show the significance of the difference between a system and the baseline.The platform puts an asterisk * to indicate that a system differs significantly from the baseline model (p<0.05) and best system per metric in the language pair is highlighted in bold.We use the same format.In the following three subsections, we report the results for each language pair.

English → Turkish MT Engines
The baseline English → Turkish MT engine has the lowest BLEU score of the nine engines in the study.However, when it was fine-tuned with the smaller localization corpus (en-tr-2), the BLEU score improved considerably from 23 to 49.6.When the large corpus was used for fine-tuning (en-tr-3), the score increased further.However, as may be observed from Table 3, although the size of the custom corpus is larger, the improvement from en-tr-2 to en-tr-3 remains modest.Similarly, in the case of chrF2, the score improves considerably when either a small or large corpus is introduced for fine-tuning.However, en-tr-2 has only a slightly higher (less than one point) score than en-tr-3 (68.1 vs 67.8, respectively).Measuring the fewest possible editing steps from MT to the human reference with TER so that a lower score implies better quality output, small and large corpora improved scores considerably.Akin to the BLEU scenario, fine-tuning with the large corpus improved the score slightly compared to fine-tuning with the small corpus (44.2 vs. 42.8,respectively).COMET scores also imply a gradual improvement with the addition of in-domain corpora.Nevertheless, as can be inferred from Table 4, the difference between en-tr-2 and en-tr-3 is only significant in COMET score and no significant change is observed in the other three metrics.Table 4.A comparison of the en-tr-2 to en-tr-3 MT systems in terms of automatic evaluation metrics.This comparison shows the impact of increasing the fine-tuning corpus from approx.500,000 words to approx.2,000,000 source words in this language pair.
These overall scores imply that for English → Turkish engines, even a small fine-tuning corpus can improve translation quality considerably, while the effect of a much larger fine-tuning corpus may only improve quality marginally when compared to the small corpus.Figure 2 provides a graphical depiction using all four evaluation metrics.

English → Spanish MT Engines
All automatic evaluation metrics scores rise when localization corpora were introduced for fine-tuning in the English → Spanish language pair.However, unlike the English → Turkish engines, the addition of the small corpus did not lead to a significantly improved BLEU score.Yet, when the large corpus was used for fine-tuning, the score improved considerably in all four metrics.The BLEU score is 37.3 using the baseline engine while it is 38.5 for en-es-2 and 48.1 for en-es-3.Using chrF2, the baseline engine scored 66.6, en-es-2 70.0 and en-es-3 74.6.Using TER, the baseline engine has a score of 46.64 while it improves to 43.10 for the small corpus and to 38.12 in the large corpus scenario.COMET scores also suggest significant improvements when further corpora are added when compared to the baseline.6.A comparison of the en-es-2 to en-es-3 MT systems in terms of automatic evaluation metrics.This comparison shows the impact of increasing the fine-tuning corpus from approx.500,000 words to approx.2,000,000 in this language pair.
As may be seen in Table 6 and Figure 3, automatic comparison of the enes-2 engine to the en-es-3 engine shows that the large localization corpus fine-tuning has brought significant and considerable improvements across all metrics.As with the English → Turkish case, we filter the 52 sentences with tags and placeholders to analyze the behavior of the engines.Average COMET scores for each engine for these sentences are as follows: 71, 80 and 83.The baseline en-es-1 engine does not seem to keep the tags or placeholders correctly with full or partial omissions or the introduction of different symbols such as "#".While en-es-2 and en-es-3 are more consistent with treatment of tags and placeholders, they do not conserve the form of the tags or placeholders, often converting the opening "{" into "\".Even finetuning with a large localization corpus does not help to solve this issue in this language pair.The fact that the errors from Table 7 are consistent across the en-es-3 engine suggests that if this engine is used in a professional translation scenario, the error could be solved by a batch search-and-replace operation and the engine could still be useful.

English → Catalan MT Engines
The English → Catalan baseline engine was fine-tuned with a different type of corpus than the others, as described in Section 3.1.Similarly to the English → Spanish engines, the large localization corpus leads to considerable improvement in all four metrics while the small corpus brought a considerable improvement in BLEU (from 38 to 42.5) and TER (from 57.9 to 49.0) and did not lead to any considerable change in chfF2 (63.2 and 63.3 respectively) and COMET (84.3 to 84.6), as may be seen in Figure 4.  8.A comparison of the en-ca-2 to en-ca-3 MT systems in terms of automatic evaluation metrics.This comparison shows the impact of increasing the fine-tuning corpus from approx.500,000 words to approx.2,000,000 in this language pair.
The change from the en-ca-2 engine to en-ca-3 also offers significant improvement across all four metrics.This leads to the conclusion that, similar to the previous engines, English → Catalan performance improves with the addition of further localization data.Finally, we overview how English → Catalan engines perform in terms of tags and placeholders and filter the 32 sentences that include tags or placeholders.These segments' average COMET scores are as follows: 81.3, 83.4 and 85.6, as may be seen in Table 8.The en-ca-1 engine tends to translate placeholders and leave space between "\" and the tags.The en-ca-2 engine is inconsistent in terms of translating or retaining the placeholders or tags and, finally, the en-ca-3 engine outputs the tags and placeholders correctly but sometimes continues to translate the placeholders as in the second example in Table 9 ("secció").

Human Evaluation
For each language pair, the test set of 210 segments translated by each of the three engines was evaluated by three reviewers using the KantanLQR interface.In total, each reviewer rated 630 output sentences.The ranking task used a scale of 3 while the adequacy and fluency tasks used a scale of 5.It was possible to calculate a percentage score for each metric by comparing the total score obtained by an engine against the total possible score.For instance, in the case of ranking, the total possible score is 1890 (reviewer count: 3 × highest score: 3 × segment count: 210).These are represented in percentages in Table 10.

English → Turkish engines
When the three Turkish reviewers evaluated the output from the three engines, in the ranking task the en-tr-1 engine attained 67.57% (1277/1890 14 ); en-tr-2 attained 77.94% (1473/1890); and en-tr-3 attained 80.63% (1524/1890) of the overall score.Average ranking scores for each engine are as follows: 2.03, 2.34 and 2.42, respectively.According to these scores, three reviewers rated en-tr-3 as the best performing engine while en-tr-1 obtained the lowest average score.
Table 11 shows the score averages for each evaluation task and enables investigating the impact of fine-tuning by data augmentation.These average scores suggest that when fine-tuning is conducted by the addition of a bilingual localization corpus, the overall quality improves considerably in this language pair.However, while improvement from en-tr-1 to en-tr-2 or en-tr-1 to en-tr-3 appears dramatic, the quality increase from en-tr-2 to en-tr-3 seems to be minimal.The en-tr-3 engine is fine-tuned with a corpus bigger than the one with en-tr-2; yet this augmentation does not result in an engine with a higher quality.In the specific case of this language pair and domain, we can infer two implications.Firstly, constant data augmentation does not necessarily increase quality in consistently and there may be a plateau after a certain amount of fine-tuning data.Secondly, even a parallel corpus as small as 500,000 source words can provide enough quality improvement to justify the use of the desktop OPUS-CAT MT application with fine-tuning.In the KantanLQR evaluation framework, an adequacy score of 4.06 and a fluency score of 4.17 (from of a scale of 5) may be considered good enough to justify the use of MT for a particular use case. Av

English → Spanish engines
The three Spanish evaluators preferred the en-es-3 engine in all the three evaluation tasks.In the ranking task, en-es-1 attained 75.03% (1418/1890), en-es-2 77.25% (1460/1890) and en-es-3 80.21% (1516/1890) of the overall score.The average score for each engine is as follows: 2.25, 2.32 and 2.41.The en-es-3 ranks as the best engine while en-es-1 ranks as the worst.
The en-es-3 engine has the highest adequacy score while en-es-1 has the lowest one.
Finally, the fluency scores of each engine are 81.97%(2582/3150), 83.02% (2615/3150) and 84.38% (2658/3150).The average fluency scores are 4.10, 4.15 and 4.22 respectively.The en-es-3 obtained the highest score while en-es-1 obtained the lowest score.Table 12 summarizes the average scores for this language pair.The en-es-3 engine obtained the highest score in all metrics.The human evaluation scores seem to improve gradually from en-es-1 to en-es-3 through data augmentation, as also evidenced by automatic metrics in Figure 3.The improvement from en-es-1 to en-es-2 in all metrics seems to be modest when compared to en-tr engines.Nonetheless, the evolution of the improvements suggests that there is still room for further improvement through the addition of additional corpora.Moreover, the average adequacy and fluency scores of the baseline engine are above four, which indicates that this engine can already provide reasonably good quality results.
Lastly, the fluency percentage of each engine are 73.90%(2328/3150), 76.79% (2419/3150), and 80.25% (2528/3150).Average fluency scores were 3.70, 3.84 and 4.01.According to these results, en-ca-3 is the most fluent engine while en-ca-1 is the least fluent.The average scores may be seen in Table 13.Considering the overall results for this language pair, en-ca-3 has the highest scores in all three metrics while en-ca-1 has the lowest except for the adequacy metric.In this metric, en-ca-1 and en-ca-2 share the same score, which indicate that the addition of 500,000 source words did not help improve adequacy.However, data augmentation with 2,000,000M+ source words seems to improve quality since the fluency and adequacy scores passed above four after fine-tuning with the larger corpus.

Agreement between Reviewers
In this subsection, we firstly focus on the evaluation results per reviewer.
Then we share the agreement percentages per engine and aggregated inter-annotator agreement scores per language pair based on Fleiss' Kappa (Fleiss 1971).
For individual reviewer ratings, we report both the percentage scores and the average scores per evaluation metrics.The tables for each language pair are presented in Annex II.Highest scores are highlighted in bold.
For the English→Turkish language pair, the scores given by all three reviewers are compatible with the overall average scores.All of them gave the highest scores to en-tr-3 and the lowest one to en-tr-1 in all three metrics for each engine (see Annex II).
In the Spanish→English language pair, fluency scores agree with the overall results insofar as all three reviewers rank en-es-3 as the most fluent engine.However, in the case of ranking and adequacy, Reviewer 2 gave the same scores for en-es-2 and en-es-3 while the others ranked en-es-3 as the best performing engine.
In the Catalan → English language pair, all three reviewers gave the highest ranking, adequacy and fluency scores to en-ca-3.This implies that there is an overall agreement between the reviewers on the performance of the three engines.
After the individual ratings by the reviewers, we focus on the agreement rates between the reviewers to check the consistency among them.Table 14 shows the percentage of agreement in the ratings in the three evaluation tasks across three MT engines per language pair.Table 15 provides the aggregated Fleiss' Kappa scores for each language pair and evaluation type.In the following paragraphs, we share the key findings from these two tables per language pair.English→Catalan: This pair has the highest agreement scores for ranking (47.14%, 44.76%, 46.67%) when compared to the other two language pairs.However, the agreement scores for adequacy (10.95%, 14.29%, 20.00%) and fluency (9.05%, 10.48%, 16.19%) are much lower.The Fleiss' Kappa scores reflect a moderate agreement for ranking (0.276) but lower for adequacy (0.211) and fluency (0.155).

Discussion and Limitations
The automatic and human evaluations seem to be compatible in terms of the best engines in each language pair when pre-trained NMT models are fine-tuned with localization corpora.Both types of evaluation also suggest that data augmentation has led to a logarithmic-like improvement in English→Turkish output.However, in the case of English→Spanish and English→Catalan, the improvements from each addition were incremental.
The fact that there were improvements suggests that there is still room for more data augmentation in all language pairs and fits with the report by Schwartz et al. (2020) that as data sizes increase, added tranches of data become less effective.In the case of English→Catalan, the impact of adding a large corpus has been bigger than in the case of English→Spanish.
If we assume that a score of over 4 in fluency and adequacy ratings justify the use of fine-tuning in OPUS-CAT for professional translators (according to the definitions for each score in KantanLQR; see Table 2), we can make the following arguments based on our localization domain: i) In an English→Turkish localization project, a fine-tuned engine with a small corpus of 500,000 source words may already provide mostly adequate and quite fluent translation results.A further augmentation of the fine-tuning data to 2,000,000+ will have a slight improvement in translation quality; ii) In an English→Spanish localization project, the baseline, a pretrained NMT engine from OPUS-MT, already provides scores over 4 in adequacy and fluency; yet, the quality can be further increased incrementally with the inclusion of 500,000 or 2,000,000 source words of fine-tuning corpora; and iii) In an English → Catalan localization project, fine-tuning with 500,000 source words will not be enough to reach a score of 4 in adequacy and fluency.When fine-tuning is performed with 2,000,000M+ source words, the quality surpasses the score of 4 very slightly but this result hints that a further quality improvement can be achieved through the addition of more fine-tuning data.However, the results in this language pair may be influenced by the quality of the fine-tuning data that was used since the data comes from multiple sources and is therefore expected to be of a more diverse nature.One limitation of the fine-tuning carried out in English→Catalan is that the fine-tuning corpus consisted of localization strings from different open-source projects and the evaluation test data was from a Microsoft project unlike other language pairs which were fine-tuned and evaluated with Microsoft corpora.
Aside from the aforementioned limitation, our study has other limitations regarding fine-tuning in OPUS-CAT as well as evaluation resulting from methodological preferences.Nieminen (2021: 214) states that it is possible to continue fine-tuning for multiple epochs and modify learning rates.However, these modifications can increase fine-tuning durations and may also lead to overfitting as observed by the author.Long durations of finetuning may not be optimum for professional translators who need to create an engine rapidly for a translation project with a tight deadline.Overfitting leads to engines that memorise the training data at the expense of losing generalisation capabilities, resulting in low quality translations.Further studies can be made using the same corpus and changing fine-tuning settings in each iteration.When it comes to evaluation limitations, localization strings are usually short and context-dependent; one string may have multiple translations depending on the context.We utilized one single test file to maximise overall consistency, but we removed in-file and cross-file repetitions, which constrained the context of the file.In our evaluation design, reviewers only saw the source string with its three possible translations within KantanLQR.We increased the number of test strings to compensate for this limitation.

Conclusion
Our study showed that it is possible to achieve significant translation quality improvements over pretrained NMT models in three language pairs finetuned with specific domain corpora.These results were achieved in a desktop Windows environment without the need to connect to an external server.While it should be noted that fine-tuning does not necessarily guarantee this outcome in every corpus size and type, it can be argued that when confidentiality and privacy are of high concern, fine-tuned, desktopbased engines can be a viable alternative to commercial systems for professional translators.Furthermore, the use of pretrained NMT engines removes the need for costly MT training and provides a more environmentally friendly alternative since less energy is consumed.
Applications such as OPUS-CAT and pretrained models coming from OPUS-MT project lay the foundation for a future where the translator is not only a passive user of MT systems but also an empowered professional who is able to make informed decisions about how and when to use MT in their workflow.This removes an element of control from the client or translation employer, but also removes the client-side cost of MT training and preparation.
Availability of free and open desktop MT applications as well as pretrained models can potentially empower translators.Moreover, when and if a translator can combine these technologies with high quality, specific domain TMs, productivity gains can be increased.Hence, having free and open domain-specific parallel corpora in very different language pairs is essential.
Extending the capabilities of OPUS-CAT (and other similar future applications) to include operating systems other than Windows will be important as well.Finally, adding more features to OPUS-CAT and other similar toolkits can lead to other productive ways of using desktop MT.
In the future, we would like to compare our best engines with commercial systems such as Google Translate to study their relative quality.Another line of study could be to train professional translators to use OPUS-CAT and collect data about their perceptions about the usability of the application within their professional workflows.
to participate in the evaluation: Muhammed Baydere, Emre Canbaz, Şevval Mutlu, Eden Y. Bravo Montenegro, Mer Moreno, Ariana López Pereira, Xènia Amorós, Eduard Simón and Marc Riera.We are also grateful to KantanAI for providing us access to their platform for research purposes.

Annex I. Task Instructions for Reviewers
Guidelines for performing the Machine  11.2022 -14.11.2022Task Guidelines 1. Please fill out the short survey aiming at collecting professional details of the participants.It should take less than 3 minutes to complete.Form link is here: https://forms.gle/kiA51ehucXzcPvtb9 2. Once you complete the survey, we will send you a link to your email address to connect to KantanLQR platform.You will need to enter with your email and create a password (if you haven't done so before).3. Once you login to the platform, you will see the dashboard with the task.
You should first click on the "?" to accept the task.Once you accept the task, you can click on the pen symbol and begin the evaluation task.The strings are from the interface of Skype web application.In case anything is not clear, you can use the comment section to write your comment.But it is not obligatory.
4. In the upper left corner, you will see the source sentence, and below it there will be its 3 machine translations by different MT engines.The order of these translations is randomized in each step to avoid bias. 5.There are 2 evaluation criteria: adequacy and fluency.In a nutshell, adequacy measures the accuracy of the translation compared to the source sentence while fluency measures how grammatically correct the translation.You will have a scale of 5 stars.More stars mean better adequacy or fluency.The "i" symbol near each title gives hints about the meaning of each star and the respective definitions.See the image (i) below.6.Finally, you are expected to rank each engine from the best to the worst.
Again, more stars mean better quality.Hence, the best translation result should get 3 stars while the worst one should get 1 star.Note that if you think two engines are equal, you can assign the same number of stars to them.7.You can press "Finish" to pause and leave the task before finishing and return later to complete it.There are 210 mostly very short source sentences to be evaluated.8.If you need more help about the task, please send an email to gokhan.dogru@uab.cat.9.A simulation of the steps from the reviewer's perspective are also available in a video: https://drive.google.com/file/d/1bNbTVUhdvJDvenVryHuHIyFjoUFkgBh6/view?usp=sharing (i) Image: Tips for understanding Adequacy and Fluency available in KantanLQR

Figure 1 .
Figure 1.A snapshot of the evaluation screen.
2-Little of the source fragment meaning is expressed in the translation fragment.3-Much of the source fragment meaning is expressed in the translation fragment.4-Most of the source fragment meaning is expressed in the translation fragment.5-All meaning expressed in the source fragment appears in the translation fragment 1-No fluency.Absolutely ungrammatical and for the most part doesn't make any sense.Translation has to be rewritten from scratch.2-Little fluency.Wrong word choice, poor grammar and syntactic structure.A lot of post-editing required.3-Quite fluent.About half of the translation contains errors and requires post-editing.4-Near native fluency.Few terminology or grammar errors which don't impact the overall understanding of the meaning.Little post-editing required 5-Native language fluency.No grammar errors, good word choice and syntactic structure.No post-editing required.

Figure 2 .
Figure 2. Comparison of the automatic evaluation scores in English → Turkish.

Figure 3 .
Figure 3.Comparison of the automatic evaluation scores in English → Spanish.

Figure 4 .
Figure 4. Comparison of the automatic evaluation scores in English → Catalan.

Table 5 . A selection of the sentences with tags or placeholders translated by three different English → Turkish systems.
Table 5 for a few examples.This analysis shows that the further addition of large corpus in the localization domain can provide better handling of tags and placeholders.

Table 11 . English → Turkish human evaluation results from three reviewers. Best score per metric shown in bold.
. Scores for En → Tr Baseline Small Corpus Large Corpus

Table 15 . Fleiss' Kappa scores per human evaluation type and language pair.
Translation Evaluation Task in the Localization Domain The task consists of evaluating the translation quality of 3 Machine Translation Engines in 3 language pairs by human reviewers.The ultimate objective of the study is to measure if a desktop MT application (OpusCAT) finetuned with different sizes of custom localization corpus can provide significant translation quality to make it useful for translators.Facultat de Traducció i d'Interpretació, Universitat Autònoma de Barcelona -School of Applied Language & Intercultural Studies, Dublin City University