Meta AI Releases BELEBELE: A Multilingual Natural Language Understanding Benchmark for Language Models

Meta AI Releases BELEBELE: A Multilingual Natural Language Understanding Benchmark for Language Models

Meta AI, in collaboration with Reka AI and Abridge AI, has introduced BELEBELE, a groundbreaking dataset designed for evaluating natural language understanding across 122 diverse languages. This dataset features 900 multiple-choice reading comprehension questions, challenging the capabilities of multilingual masked language models (MLMs) and large language models (LLMs).

Unlike traditional benchmarks, BELEBELE was meticulously created by human experts without relying on machine translation, ensuring high quality and alignment across translations. It includes languages with non-Latin scripts and addresses the performance gap between humans and models, emphasizing the need for improvement.

The study also explores how pretraining data and cross-lingual transfer impact language models’ multilingual capabilities, revealing insights crucial for the development of inclusive AI systems.

[2308.16884] The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants (arxiv.org)

2308.16884.pdf (arxiv.org)

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.