Machine Learning Improves Arabic Speech Transcription Capabilities


Thanks to advances in speech and natural language processing, there is hope that one day you will be able to ask your virtual assistant what the best salad ingredients are. Currently, it is possible to ask your home device to play music or to open it with a voice command, which is a feature already found on many devices.

If you speak Moroccan, Algerian, Egyptian, Sudanese, or any of the other dialects of the Arabic language, which are immensely varied from region to region, where some of them are mutually unintelligible, the story is different. If your native language is Arabic, Finnish, Mongolian, Navajo, or any other language with a high level of morphological complexity, you may feel left out.

These complex constructions intrigued Ahmed Ali to find a solution. He is Principal Engineer of the Arabic Language Technologies group at the Qatar Computer Research Institute (QCRI), part of the Qatar Foundation’s Hamad Bin Khalifa University and founder of ArabicSpeech, a “community that exists for the benefit of science. Arabic speech and speech technologies. . “

Qatar Foundation Headquarters

Ali was captivated by the idea of ​​talking to cars, appliances, and devices many years ago while at IBM. “Can we build a machine capable of understanding different dialects: an Egyptian pediatrician to automate a recipe, a Syrian teacher to help children get the basic parts of their lesson, or a Moroccan chef to describe the best couscous recipe?” he affirms. However, the algorithms that power those machines cannot examine the 30 or so varieties of Arabic, much less make sense of them. Today, most speech recognition tools work only in English and a few other languages.

The coronavirus pandemic has further fueled an already intensifying reliance on voice technologies, where the way natural language processing technologies have helped people meet stay-at-home guidelines and measures of physical distancing. However, while we’ve been using voice commands to help with e-commerce shopping and managing our homes, the future holds even more applications.

Millions of people around the world use Massive Open Online Courses (MOOCs) for their open access and unlimited participation. Speech recognition is one of the main features of MOOCs, where students can search within specific areas in the spoken content of courses and enable translations through subtitles. Speech technology enables lectures to be digitized to display spoken words as text in university classrooms.

Ahmed Ali, Hamad Bin Kahlifa University

According to a recent article in Speech Technology magazine, the speech and speech recognition market is forecast to reach $ 26.8 billion by 2025 as millions of consumers and businesses around the world rely on voice robots not only for interact with your appliances or cars, but also to improve customer service, drive health care innovations, and improve accessibility and inclusion for those with hearing, speech, or motor impairments.

In a 2019 survey, Capgemini predicted that by 2022, more than two in three consumers would opt for voice assistants over visits to stores or bank branches; a proportion that could rightly rise, given the home and physically estranged living and business that the epidemic has imposed on the world for more than a year and a half.

However, these devices do not reach large areas of the world. For those 30 types of Arab and millions of people, that is a substantially lost opportunity.

Arabic for machines

English or French speaking voice robots are far from perfect. However, teaching machines to understand Arabic is particularly difficult for several reasons. These are three commonly recognized challenges:

  1. Lack of diacritics. Arabic dialects are vernacular, as it is mainly spoken. Most of the available text is not accredited, which means that it lacks accents such as high (´) or low (`) that indicate the sound values ​​of the letters. Therefore, it is difficult to determine where the vowels go.
  2. Lack of resources. There is a dearth of labeled data for the different Arabic dialects. Collectively, they lack standardized spelling rules that dictate how to write a language, including rules or spelling, word spacing, word breaks, and emphasis. These resources are crucial for training computer models, and the fact that they are so few has hampered the development of Arabic speech recognition.
  3. Morphological complexity. Arabic speakers get involved in many code changes. For example, in the areas colonized by the French (North Africa, Morocco, Algeria, and Tunisia) the dialects include many borrowed French words. Consequently, there are a large number of so-called words outside the vocabulary, which speech recognition technologies cannot understand because these words are not Arabic.

“But the field is moving at lightning speed,” says Ali. It is a collaborative effort among many researchers to make it move even faster. Ali’s Arabic Language Technology Lab is leading the ArabicSpeech project to bring together translations from Arabic with the native dialects of each region. For example, Arabic dialects can be divided into four regional dialects: North African, Egyptian, Gulf, and Levantine. However, since the dialects do not meet the limits, this can be as detailed as one dialect per city; for example, a native Egyptian speaker can differentiate between his Alexandrian dialect and his fellow Aswan (a distance of 1000 kilometers on the map).

Building a future tech savvy for everyone

At this point, machines are as accurate as human transcriptionists, thanks in large part to advances in deep neural networks, a subfield of machine learning in artificial intelligence that is based on algorithms inspired by how the human brain works, biological and functionally. However, until recently, speech recognition has been a bit hacked. The technology has a history of relying on different modules for acoustic modeling, pronunciation lexicon construction, and language modeling; all modules that need to be trained separately. More recently, researchers have been training models that convert acoustic features directly into text transcripts, potentially optimizing all parts for the final task.

Even with these advancements, Ali still cannot voice command to most devices in his native Arabic. “It’s 2021 and I still can’t speak to many machines in my dialect,” he says. “I mean, now I have a device that can understand my English, but the automatic recognition of Arabic speech in various dialects has not happened yet.”

Making this happen is the focus of Ali’s work, culminating in the first transformer for the recognition of Arabic speech and its dialects; one that has achieved unmatched performance thus far. Dubbed the QCRI Advanced Transcription System, the technology is currently being used by broadcasters Al-Jazeera, DW and the BBC to transcribe content online.

There are a few reasons why Ali and his team have been successful in building these speech engines right now. Mainly, he says, “It is necessary to have resources in all dialects. We need to accumulate the resources so that we can then train the model. “Advances in computer processing mean that computationally intensive machine learning now occurs in a graphics processing unit, which can quickly process and display complex graphics. As Ali says,” We have great architecture, good modules, and we have data that represents reality. ”

Researchers from QCRI and Kanari AI recently created models that can achieve human parity in Arabic broadcast news. The system demonstrates the impact of captioning Aljazeera’s daily reports. While the human error rate in English (HER) is about 5.6%, research revealed that the HER in Arabic is significantly higher and can reach 10% due to the morphological complexity of the language and lack of rules. standard spelling in dialectal Arabic. Thanks to recent advancements in deep learning and end-to-end architecture, the Arabic speech recognition engine manages to outperform native speakers in broadcast news.

While modern standard Arabic speech recognition seems to work well, the researchers at QCRI and Kanari AI are engrossed in testing the limits of dialect processing and achieving excellent results. Since no one speaks Modern Standard Arabic at home, what we need to do is pay attention to the dialect so that our voice assistants understand us.

This content was written by Qatar Computer Research Institute, Hamad Bin Khalifa University, member of the Qatar Foundation. It was not written by the editorial staff of MIT Technology Review.


www.technologyreview.com

Leave a Reply

Your email address will not be published. Required fields are marked *