Publications
Peer-reviewed Papers
Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, Amanda Myntti, Dayyán O'Brien, Stephan Oepen, Proyag Pal, Jousia Piha, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Tereza Vojtěchová, and Jaume Zaragoza-Bernabeu.
Published at ACL 2025.
[PDF] [ACL Anthology] [arXiv] [BibTeX] View abstract
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value. Proyag Pal, Alexandra Birch, and Kenneth Heafield.
Published at ACL 2024.
[PDF] [ACL Anthology] [Poster] [BibTeX] View abstract
Despite the fact that document-level machine translation has inherent advantages over sentence-level machine translation due to additional information available to a model from document context, most translation systems continue to operate at a sentence level. This is primarily due to the severe lack of publicly available large-scale parallel corpora at the document level. We release a large-scale open parallel corpus with document context extracted from ParaCrawl in five language pairs, along with code to compile document-level datasets for any language pair supported by ParaCrawl. We train context-aware models on these datasets and find improvements in terms of overall translation quality and targeted document-level phenomena. We also analyse how much long-range information is useful to model some of these discourse phenomena and find models are able to utilise context from several preceding sentences. Proyag Pal, Brian Thompson, Yogesh Virkar, Prashant Mathur, Alexandra Chronopoulou, and Marcello Federico.
Published at INTERSPEECH 2023.
[PDF] [ISCA Archive] [arXiv] [BibTeX] View abstract
To translate speech for automatic dubbing, machine translation needs to be isochronous, i.e. translated speech needs to be aligned with the source in terms of speech durations. We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences. We also introduce auxiliary counters to help the decoder to keep track of the timing information while generating target phonemes. We show that our model improves translation quality and isochrony compared to previous work where the translation model is instead trained to predict interleaved sequences of phonemes and durations. Milind Agarwal, Sweta Agrawal, Antonios Anastasopoulos, Luisa Bentivogli, Ondřej Bojar, Claudia Borg, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda Chen, William Chen, Khalid Choukri, Alexandra Chronopoulou, Anna Currey, Thierry Declerck, Qianqian Dong, Kevin Duh, Yannick Estève, Marcello Federico, Souhir Gahbiche, Barry Haddow, Benjamin Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Javorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Kumar, Pengwei Li, Xutai Ma, Prashant Mathur, Evgeny Matusov, Paul McNamee, John P. McCrae, Kenton Murray, Maria Nadejde, Satoshi Nakamura, Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, Atul Kr. Ojha, John E. Ortega, Proyag Pal, Juan Pino, Lonneke van der Plas, Peter Polák, Elijah Rippeth, Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Sebastian Stüker, Katsuhito Sudoh, Yun Tang, Brian Thompson, Kevin Tran, Marco Turchi, Alex Waibel, Mingxuan Wang, Shinji Watanabe, and Rodolfo Zevallos.
Published at IWSLT 2023.
[PDF] [ACL Anthology] [BibTeX] View abstract
This paper reports on the shared tasks organized by the 20th IWSLT Conference. The shared tasks address 9 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, multilingual, dialect and low-resource speech translation, and formality control. The shared tasks attracted a total of 38 submissions by 31 teams. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia. Proyag Pal and Kenneth Heafield.
Published at EACL (Findings) 2023.
[PDF] [ACL Anthology] [BibTeX] View abstract
We identify hard problems for neural machine translation models by analyzing progressively higher-scoring translations generated by letting models cheat to various degrees. If a system cheats and still gets something wrong, that suggests it is a hard problem. We experiment with two forms of cheating: providing the model a compressed representation of the target as an additional input, and fine-tuning on the test set. Contrary to popular belief, we find that the most frequent tokens are not necessarily the most accurately translated due to these often being function words and punctuation that can be used more flexibly in translation, or content words which can easily be paraphrased. We systematically analyze system outputs to identify categories of tokens which are particularly hard for the model to translate, and find that this includes certain types of named entities, subordinating conjunctions, and unknown and foreign words. We also encounter a phenomenon where words, often names, which were not infrequent in the training data are still repeatedly mistranslated by the models — we dub this the Fleetwood Mac problem. Proyag Pal and Kenneth Heafield.
Published at NAACL 2022.
[PDF] [ACL Anthology] [Poster] [BibTeX] View abstract
This paper describes a method to quantify the amount of information H(t|s) added by the target sentence t that is not present in the source s in a neural machine translation system. We do this by providing the model the target sentence in a highly compressed form (a "cheat code"), and exploring the effect of the size of the cheat code. We find that the model is able to capture extra information from just a single float representation of the target and nearly reproduces the target with two 32-bit floats per target token. Proyag Pal, Alham Fikri Aji, Pinzhen Chen, and Sukanta Sen.
Published at WMT21 at EMNLP 2021.
[PDF] [ACL Anthology] [Poster] [BibTeX] View abstract
We describe the University of Edinburgh’s Bengali↔Hindi constrained systems submitted to the WMT21 News Translation task. We submitted ensembles of Transformer models built with large-scale back-translation and fine-tuned on subsets of training data retrieved based on similarity to the target domain. For both translation directions, our submissions are among the best-performing constrained systems according to human evaluation. Theses
Proyag Pal
PhD Thesis (University of Edinburgh) 2024.
[PDF] View abstract
Neural Machine Translation (MT) has long been established as a successful paradigm to produce high-quality MT across many languages and domains. However, it suffers from one significant limitation – it is too often formulated as a task of translating isolated sentences in a source language into sentences in the target language. This renders standard MT models unable to capture any information that is not in the sentence, such as document context, speaker information, the domain of the text, external constraints etc. This thesis aims to study this limitation, analyse the shortcomings of sentence-level MT, and present some approaches to enrich MT models to overcome this limitation. The first part of this thesis introduces a method to quantify the amount of information missing from source sentences that is needed to translate them perfectly. This method is called “cheat codes” and it allows us to establish an upper bound on the amount of additional information that the model needs to be provided to be able to exactly reproduce reference translations. We find that a surprisingly small amount of leaked information about the target in addition to the source is enough to achieve this. We also use this method to study what parts of translation are difficult for these models to learn correctly, even in the presence of extra information. This analysis allows us to signpost some hard problems for neural MT for further research to focus on. The second part of the thesis presents two examples of how MT can be augmented with extra information to improve translation quality or overall user experience in specific applications. The first example is using document context, which is always used by human translators when translating text, but is rarely present in parallel corpora. We extract and publish a large-scale dataset of parallel sentences with corresponding contexts from existing publicly available resources, and show that this data helps improve translation performance in terms of overall quality as well as specific document-level phenomena. The second example is providing timing constraints to an isochronous MT model for use in automatic dubbing. By incorporating duration information and keeping track of it while translating, the model can produce translations that better match the source audio, which eventually results in a better user experience when viewing the automatically dubbed content. On the whole, we find that even though a relatively small amount of information is missing from sentence-level MT, enriching the models with these small pieces of information can have a significant positive impact on the quality and usefulness of MT systems in a wide variety of situations. We provide detailed analyses, datasets, and methods to build better MT systems and encourage future research in this direction. Proyag Pal
MSc Thesis (University of Edinburgh) 2017.
[PDF] View abstract
Neural Machine Translation in its current form suffers from some problems such as loss-evaluation mismatch and exposure bias. Reward Augmented Maximum Likehood is a technique that directly incorporates the task evaluation metrics such as BLEU score into the traditional maximum likelihood training framework. This is done by augmenting the training output targets with outputs that are sampled proportional to their exponentiated scaled rewards, on which cross-entropy is optimised. This is a more computationally efficient method than reinforcement learning-based methods and can be trained effectively from a cold start without bootstrapping with a cross-entropy trained model. This project implements Reward Augmented Maximum Likelihood in the Nematus neural machine translation framework, and observes significant improvements in BLEU score over a model trained to optimise perplexity.