A benchmark Arabic dataset for question classification with AAFAQ taxonomy

Essam, Mariam1 ; Deif, Mohanad 2 ; Ali Algamdi, Shabbab3 ; Elgohary, Rania4

Published Jul 17, 2025 on Dryad. https://doi.org/10.5061/dryad.9w0vt4brx

Data files

Jul 17, 2025 version files 4.49 MB

AAFAQ_Dataset_utf_formatted_Translated_UTF8_BOM_1.csv

1.65 MB
AAFAQ_Dataset_utf_formatted.csv

2.84 MB
README.md
3.58 KB

Abstract

Arabic Natural Language Processing (NLP) also suffers from the morphological complexity of the language itself, as well as limited, high-quality, annotated resources. In this work, we introduce the AAFAQ Dataset, an open-domain resource to develop semantic and cognitive question classification in Modern Standard Arabic (MSA). The dataset consists of 5,009 records annotated with rich attributes such as Question Tool, Intent, Answer Type, Cognitive Level, and Temporal Context, among others. Based on the AAFAQ Taxonomy that symbolizes the "horizons" of question understanding, this dataset extends the frontier of Arabic QAS to capture the semantic and contextual intricacies of Arabic questions. It has been tested for its utility by fine-tuning AraBERT on this dataset and gave very high performance in classification; integration with Alpaca + Gemma-9B Unsloth models has demonstrated enhanced metrics leveraging multi-attribute classification. This provides a comprehensive resource for Arabic question classification, positioning the AAFAQ Dataset as a benchmark for research in Arabic NLP that would advance education, cognitive research, and multilingual AI systems.

https://doi.org/10.5061/dryad.9w0vt4brx

Description of the data and file structure

The AAFAQ Dataset was collected and validated through experimental efforts focused on fine-tuning Arabic NLP models, such as AraBERT, for multi-label question classification. Additional experiments included integrating the dataset with generative answering systems like Alpaca + Gemma-9B Unsloth to enhance metrics for multi-attribute classification. The AAFAQ Dataset is a rich and comprehensive Arabic dataset designed for semantic and cognitive question classification in Modern Standard Arabic (MSA). The dataset consists of 5,009 records annotated with a variety of attributes, including Question Tool, Intent, Answer Type, Cognitive Level, and Temporal Context, among others. It serves as a benchmark resource for research in Arabic NLP, advancing fields such as education, cognitive research, and multilingual AI systems.

The name AAFAQ (Arabic: آفَاق) means "horizons," symbolizing the goal of expanding understanding and pushing the boundaries of Arabic Question Answering Systems (QAS) toward deeper comprehension of the language's complexities.

Files Included

AAFAQ_Dataset_utf_formatted.csv: The primary dataset file in CSV format, containing 5,009 rows and 15 columns.

Columns:
1. QuestionID: Unique numerical identifier for each question.
2. QuestionText: The text of the question.
3. QuestionTool: The interrogative tool or word used in the question (e.g., "What," "Why").
4. QuestionToolType: Type of interrogative tool (e.g., imperative, implicit).
5. QuestionType: Categorization as factoid or non-factoid.
6. List: Indicates whether the expected answer is a list.
7. AnswerType: The type of answer expected (e.g., number, description).
8. Intent: Purpose or intent of the question (e.g., informational, explanatory).
9. CognitiveLevel: Cognitive skill level required to answer the question.
10. Category: The domain or field of the question (e.g., health, education).
11. Subjectivity: Whether the question is subjective or objective.
12. TemporalContext: Time frame referenced in the question (e.g., past, present).
13. PurposeContext: Goal of the question (e.g., problem-solving, decision-making).
14. AnswerSourceText: Source text from which the answer is derived.
15. Answer: The actual answer to the question.

AAFAQ_Dataset_utf_formatted_Translated_UTF8_BOM_1.csv is the English translated version. However, please note that the following fields were not translated: QuestionText, QuestionParticle, and Answer. These fields contain linguistically sensitive content, and translating them may result in a loss of meaning or misrepresentation of the original context. To preserve the integrity and accuracy of the data, we have kept it in the original Arabic.

Dataset Features

Size: Approximately 2.82 MB
Format: CSV
Annotations: Rich manual annotations validated with high inter-annotator agreement (85%).
Coverage: Semantic, cognitive, and contextual intricacies of Arabic questions

Code/software Requirements

Python (Version 3.7 or higher)
Libraries: Pandas, NumPy, Matplotlib (for data exploration and visualization).
NLP Frameworks:
- AraBERT for fine-tuning language models.
- Hugging Face Transformers for model integration.