A relationship between Autism-Spectrum Quotient and face viewing behavior in 98 participants

Wegner-Clemens, Kira 1 ; Rennig, Johannes1 ; Beauchamp, Michael S 1

Published Apr 15, 2020 on Dryad. https://doi.org/10.5061/dryad.zpc866t5c

Data files

Apr 15, 2020 version files 4.05 MB

Abstract

Faces are one of the most important stimuli that we encounter, but humans vary dramatically in their behavior when viewing a face: some individuals preferentially fixate the eyes, others fixate the mouth, and still others show an intermediate pattern. The determinants of these large individual differences are unknown. However, individuals with Autism Spectrum Disorder (ASD) spend less time fixating the eyes of a viewed face than controls, suggesting the hypothesis that autistic traits in healthy adults might explain individual differences in face viewing behavior. Autistic traits were measured in 98 healthy adults recruited from an academic setting using the Autism-Spectrum Quotient, a validated 50-statement questionnaire. Fixations were measured using a video-based eye tracker while participants viewed two different types of audiovisual movies: short videos of talker speaking single syllables and longer videos of talkers speaking sentences in a social context. For both types of movies, there was a positive correlation between Autism-Spectrum Quotient score and percent of time fixating the lower half of the face that explained from 4% to 10% of the variance in individual face viewing behavior. This effect suggests that in healthy adults, autistic traits are one of many factors that contribute to individual differences in face viewing behavior.

Methods

Design and Participants

Participants provided written informed consent under an experimental protocol approved by the Committee for the Protection of Human Participants of the Baylor College of Medicine, Houston, TX. Experimenters who recorded the stimuli gave informed consent for publication of identifying images in an online open-access publication. All experiments were performed in accordance with relevant guidelines and regulations. Participants were primarily recruited in a university setting, from Baylor College of Medicine and Rice University students and employees. A power analysis was conducted prior to collecting the data to determine sample size, based on an expected effect size of r=0.45, which provided a suggested sample size of 71 (G*Power).

Participants were presented with stimuli consisting of auditory and visual recordings of human talkers and then completd the Autism-Spectrum Quotient as a measure of autistic traits. Stimuli were presented through Matlab with Psychtoolbox. In experiment 1, participants (n = 98; 66 female, mean age 21, age range 18-45) were shown short syllable videos recorded in lab and given an explicit task. In experiment 2, participants (n = 70, 49 female, mean age 20, range 18-45; all native English speakers) were shown full sentence videos taken from longer speeches and interviews and not given a task.

Autism-Spectrum Quotient

Participants completed the Autism-Spectrum Quotient (AQ) online before the test session. This self-administered instrument consists of 50 items designed to assess autistic traits in adults of normal intelligence. The items reflect the differences in social skills, communication skills, imagination, attention to detail, and attention-switching noted in ASD. Each item is a statement about the participant's ability or preference, which the participant rates as definitely agree, slightly agree, slightly disagree, or definitely disagree. The test was scored using the method described in the original presentation of the test: each answer was collapsed into one of two categories (yes, for "definitely agree" or "slightly agree"; no, for slightly disagree or definitely disagree); assigned a value of 1 or 0, depending on the question; and the scores across all questions summed. A higher score indicates a higher degree of autistic-like traits, with a maximum score of 50 points.

Experiment 1: Syllable Movies

Participants (n = 98; 66 female, mean age 21, age range 18-45) completed a syllable identification task on short audiovisual speech movies. Each trial began with a fixation crosshairs presented outside of the location of where the face would appear in order to simulate natural viewing conditions in which faces rarely appear at the center of gaze(7). The visual crosshairs then disappeared, and participants were free to fixate. The visual stimulus appeared for 2 seconds, after which the participants had 1 second to report the syllable ("ba,"da," or "ga") they perceived via button press. Four different speakers appeared and repeated each syllable 60 times, for a total of 240 trials. The stimuli were randomized by speaker and syllable, then that same order was presented to all participants. The trials were divided into 4 blocks, separated by rest intervals during which the eye tracker was recalibrated. Total eye tracking time was 12 minutes. Only one speaker was shown per trial and each directly faced the camera. The face subtended approximately 10 cm by 13 cm (6 degrees wide by 8 degrees high).

Experiment 2: Sentence Movies

A subset of the Experiment 1 participants (n = 70, 49 female, mean age 20, range 18-45; all native English speakers) completed a short movie watching task. Participants watched 40 short clips taken from interviews or speeches that had been uploaded to YouTube under the Creative Commons license. As in the syllable videos, each trial began with a fixation cross outside of where the face would later appear. The clips lasted an average of 23 seconds, ranging from 13 and 42 seconds. The clips were shown in approximately 4-minute blocks of 10 videos each, separated by rest intervals where the eye tracker was recalibrated. The total task time was 15.5 minutes. Movies were balanced between blocks by length and speaker gender, then the stimuli were presented in the same order to all participants. Participants were not given a specific task or required to make a behavioral response following the videos in order to encourage as close to natural free viewing as possible. Each movie showed only one speaker and was cropped so that the speaker's face filled most of the frame in order to minimize background distractions. Some clips featured direct gaze from the speakers, while speakers looked to the side in others. A different speaker appeared each trial, with 18 clips featuring female speakers and 22 featuring male speakers. The stimuli filled the entire screen (70 cm x 40 cm) and the face subtended approximately 14 cm by 18 cm (9 degrees wide by 11 degrees high).

Eye Tracking Methods and Analysis

Participants' eye movements were recorded using an infrared eye tracker (Eye Link 1000 Plus, SR Research Ltd., Ottawa, Ontario, Canada) as they viewed visual stimuli presented on a display (Display++ LCD Monitor, 32" 1920 × 1080, 120 Hz, Cambridge Research Systems, Rochester, UK) and listened to speech through speakers located on either side of the screen. Eye tracking stability was increased with a chin rest placed 90 cm from the display. Eye tracking was performed with a sampling rate of 500 Hz. The eye tracker was calibrated using a 9-target array before each block, for a total of 4 times in each task. Fixations, saccades, and blinks were identified by SR Research's Eyelink software. Saccades, blinks and fixations that started before stimulus onset were excluded from the analysis.

In order to summarize the fixation data, we used a region of interest (ROI) approach. For syllable videos, each fixation in each trial was marked as within or outside the lower face region of interest on each trial. ROIs were hand drawn for each speaker. A dividing line was drawn at the tip of the speaker's nose. Since the syllable videos were only 2 seconds and the speaker's head did not move, the same ROI coordinates were used for each frame. Fixations falling in the lower face region were then summed and divided by the total trial fixation time to calculate a percent time spent fixating lower face on each trial. These trial percentages were then averaged, weighted by the trial duration, to calculate the percent of time spent fixating the lower face for each participant.

The sentence videos were taken from television programs rather than created in the lab, resulting in variation in the face position from frame to frame. Together with the length of the videos, this precluded the use of hand-drawn ROIs. Instead, the Cascade Object Detector in Matlab's Computer Vision System Toolbox was used to automatically generate a box surrounding the face in each video frame. A location at 40% of the face box height was selected as the dividing line before the upper and lower face in order to align most closely to the tip of the nose dividing line used in the syllable task. To check the accuracy of the face detection tool, all ROIs were visualized on top of the original videos. If a face was not identified in a frame or the location was not identified incorrectly, the coordinates from the preceding and succeeding frames were averaged to estimate a location.

The fixation location at the time of presentation of each video frame was measured. To calculate percentage of fixation time, the number of frames with fixation in the lower face were divided by the total number of frames. This value was then averaged across all videos, weighted by the number of frames (since all videos had the same frame rate) in order to calculate a grand mean lower-face fixation percentage for each participant.