Text understanding in GPT-4 vs humans
Data files
Apr 28, 2025 version files 488.32 KB
-
README.md
1.97 KB
-
stored_data_from_GPT-4.zip
486.34 KB
Abstract
We examine whether a leading AI system GPT-4 understands text as well as humans do, first using a well-established standardized test of discourse comprehension. On this test, GPT-4 performs slightly, but not statistically significantly, better than humans given the very high level of human performance. Both GPT-4 and humans make correct inferences about information that is not explicitly stated in the text, a critical test of understanding. Next, we use more difficult passages to determine whether that could allow larger differences between GPT-4 and humans. GPT-4 does considerably better on this more difficult text than do the high school and university students for whom these passages are designed, as admission tests of student reading comprehension. Deeper exploration of GPT-4’s performance on material from one of these admission tests reveals generally accepted signatures of genuine understanding, namely generalization and inference.
https://doi.org/10.5061/dryad.jq2bvq8jk(opens in new window)
The top-level folder holds two subfolders, Section 2 and Section 3, referring to sections of the results in the manuscript.
Description of the data and file structure
section 2
The Section 2 folder has 2 subfolders, one (discourse comprehension data main expt
) dealing with an experiment wherein GPT-4 reads 11 stories from the Discourse Comprehension Test and responds with textual answers to 8 yes/no questions per story. This subfolder contains a Word document for each of the 11 stories which includes the prompt and questions given to GPT-4, the question types and correct answers, and the answers given by GPT-4. There is also a Word document describing abbreviations (Abbreviations in the score correct file of section 2 data
) used in the scoring of the responses and another file describing the scoring (score correct.xlsx
).
The second subfolder (implicit main ideas expt
) concerns an experiment in which GPT-4 is encouraged to summarize the story and make inferences from the text. There is a Word file for each of the 11 stories presenting GPT-4’s textual responses. We note each inference in yellow background and count the number of inferences. Another file (inferences mean sd range.xlsx
) presents a statistical analysis of the numbers of inferences.
section 3
The Section 3 subfolder concerns GPT responses to 14 multiple-choice questions from an LSAT test, used for students applying for admission to top US law schools. The topics are pop art, climate change, aboriginal rights, and potential speculative bubbles in tulip markets. GPT-4 gets all 14 questions correct but is also asked to justify its answers. The resulting GPT-4 responses are scored for going beyond the stated literal text, in the form of the processes of generalization and inference; these are identified with yellow backing.
The dataset is comprised of text responses from GPT-4 after reading passages of text. It is scored and organized in folders.