Skip to main content
Dryad

A benchmark Arabic dataset for question classification with AAFAQ taxonomy

Data files

Jul 17, 2025 version files 4.49 MB

Abstract

Arabic Natural Language Processing (NLP) also suffers from the morphological complexity of the language itself, as well as limited, high-quality, annotated resources. In this work, we introduce the AAFAQ Dataset, an open-domain resource to develop semantic and cognitive question classification in Modern Standard Arabic (MSA). The dataset consists of 5,009 records annotated with rich attributes such as Question Tool, Intent, Answer Type, Cognitive Level, and Temporal Context, among others. Based on the AAFAQ Taxonomy that symbolizes the "horizons" of question understanding, this dataset extends the frontier of Arabic QAS to capture the semantic and contextual intricacies of Arabic questions. It has been tested for its utility by fine-tuning AraBERT on this dataset and gave very high performance in classification; integration with Alpaca + Gemma-9B Unsloth models has demonstrated enhanced metrics leveraging multi-attribute classification. This provides a comprehensive resource for Arabic question classification, positioning the AAFAQ Dataset as a benchmark for research in Arabic NLP that would advance education, cognitive research, and multilingual AI systems.