Oana Guzman: A Low-Resource Language Expert
Oana Guzman is a computer scientist and a researcher in the field of natural language processing (NLP). She is one of the authors of a recent paper titled “OCR Improves Machine Translation for Low-Resource Languages” , which was accepted at ACL Findings 2022.
In this paper, Guzman and her co-authors investigate the performance of current optical character recognition (OCR) systems on low-resource languages and low-resource scripts. They introduce and make publicly available a novel benchmark, OCR4MT, consisting of real and synthetic data, enriched with noise, for 60 low-resource languages in low-resource scripts. They evaluate state-of-the-art OCR systems on their benchmark and analyse most common errors. They also show that OCR monolingual data is a valuable resource that can increase performance of machine translation models, when used in backtranslation.
Guzman’s research interests include machine translation, low-resource languages, multilingual NLP, and OCR. She is also active on social media platforms such as Facebook , where she shares her work and connects with other researchers and enthusiasts.
Guzman’s co-authors are also experts in NLP and machine translation. Jean Maillard is a research scientist at Meta AI , where he works on multilingual NLP and low-resource languages. Vishrav Chaudhary is a senior applied scientist at Microsoft Turing , where he leads the development of machine translation models for low-resource languages and domains. Francisco GuzmÃ¡n is a research scientist at Meta AI , where he focuses on machine translation, multilingual NLP, and data mining.
Backtranslation is a technique that uses a machine translation model to generate synthetic parallel data from monolingual data. For example, if we want to improve a Nepali-English translation model, we can use an existing English-Nepali model to translate monolingual English data into Nepali, and then use the resulting Nepali-English pairs as additional training data for the original model. Backtranslation has been shown to be effective for improving machine translation quality, especially for low-resource languages (Sennrich et al., 2016; Edunov et al., 2018; Zhang et al., 2020).
OCR for low-resource scripts poses many challenges, such as the lack of standardization, the diversity of writing systems, the presence of diacritics and ligatures, the variation in fonts and layouts, and the scarcity of labeled data and evaluation metrics (Smith, 2007b; Wick et al., 2020). These challenges make it difficult to apply existing OCR models, which are mostly trained and tested on high-resource languages and scripts, such as English and Latin.
To address these challenges, Guzman and her co-authors propose OCR4MT, a new benchmark that covers 60 low-resource languages in 10 low-resource scripts: Arabic, Bengali, Devanagari, Ethiopic, Georgian, Greek, Hebrew, Khmer, Mongolian, and Thai. The benchmark consists of both real and synthetic data, with different levels of noise and degradation. The real data is collected from various online sources, such as books, newspapers, magazines, and websites. The synthetic data is generated by applying random transformations and distortions to the real data.
The authors evaluate four state-of-the-art OCR systems on their benchmark: Tesseract (Smith et al., 2009), EasyOCR (Anuwongcharoen et al., 2020), Google Cloud Vision API1 , and Microsoft Azure Computer Vision API2 . They report the character error rate (CER) and the word error rate (WER) for each system on each language and script. They also analyse the most common types of errors made by each system, such as insertion, deletion, substitution, segmentation, and diacritic errors.