Skip to main content
Dryad

MLRegTest: A benchmark for the machine learning of regular languages

Data files

Nov 17, 2023 version files 69.14 GB
Jul 13, 2024 version files 106.41 GB

Abstract

MLRegTest is a benchmark for machine learning systems on sequence classification, which contains training, development, and test sets from 1,800 regular languages. MLRegTest organizes its languages according to their logical complexity (monadic second order, first order, propositional, or monomial expressions) and the kind of logical literals (string, tier-string, subsequence, or combinations thereof). The logical complexity and choice of literal provides a systematic way to understand different kinds of long-distance dependencies in regular languages, and therefore to understand the capacities of different ML systems to learn such long-distance dependencies.