Validating marker-less pose estimation with 3D x-ray radiography
Moore, Dalton; Walker, Jeffrey; MacLean, Jason; Hatsopoulos, Nicholas (2022), Validating marker-less pose estimation with 3D x-ray radiography, Dryad, Dataset, https://doi.org/10.5061/dryad.d7wm37q2z
These data were generated to evaluate the accuracy of DeepLabCut (DLC), a deep learning marker-less motion capture approach, by comparing it to a 3D x-ray video radiography system that tracks markers placed under the skin (XROMM). We recorded behavioral data simultaneously with XROMM and RGB video as marmosets foraged and reconstructed three-dimensional kinematics in a common coordinate system. We used XMALab to track 11 XROMM markers, and we used the toolkit Anipose to filter and triangulate DLC trajectories of 11 corresponding markers on the forelimb and torso. We performed a parameter sweep of relevant Anipose and post-processing parameters to characterize their effect on tracking quality. We compared the median error of DLC+Anipose to human labeling performance and placed this error in the context of the animal's range of motion.
These experiments were conducted with two common marmosets (Callithrix jacchus) (an 8-year old, 356g male and a 7-year old, 418g female). All methods were approved by the Institutional Animal Care and Use Committee of the University of Chicago.
The two marmosets were placed together in a 1m x 1m x 1m cage with a modular foraging apparatus attached to the top of the cage, as previously described by Walker et al. (2020). The marmosets were allowed to forage voluntarily throughout recording sessions that lasted 1-2 hours. Recordings of individual trials were triggered manually with a foot pedal by the experimenters when the marmosets appeared ready to initiate a reach. The manual trigger initiated synchronized video collection by the XROMM system (Brainerd et al., 2010) and two visible light cameras, each described in further detail below. We retained all trials that captured right-handed reaches. Marmoset TY produced four useful reaching events containing 5 total reaches and marmoset PT produced 13 reaching events containing 17 reaches.
Bi-planar X-ray sources and image intensifiers (90kV, 25mA at 200 fps) were used to track the 3D position of radiopaque tantalum beads (0.5-1 mm, Bal-tec) placed subcutaneously in the arm, hand, and torso. Details of bead implants can be found in Walker et al. (2020), in which the authors also report estimating XROMM marker tracking precision of 0.06 mm based on the standard deviation of inter-marker distances during a recording of a calibration specimen. Marker locations were chosen to approximate the recommendations given by the International Society of Biomechanics for defining coordinate systems of the upper limb and torso in humans (Wu et al., 2005). These recommendations were adapted to the marmoset and constrained by surgical considerations. Positions of 13 beads were tracked using a semi-automated process in XMALab (Knorlein et al., 2016) following the procedure described there and in the XMALab User Guide (https://bitbucket.org/xromm/xmalab/wiki/Home). Two beads implanted in the anterior torso were ignored for comparison with DLC because corresponding positions on the skin were occluded in nearly every frame captured by visible light cameras.
Two high-speed cameras (FLIR Blackfly S, 200 fps, 1440x1080 resolution) were used to record video for analysis by DLC. The cameras were positioned to optimize visibility of the right upper limb during reaching behavior in the foraging apparatus and to minimize occlusions, while avoiding the path between the X-ray sources and image intensifiers (Fig. 1A). The cameras were triggered to record continuous images between the onset and offset of the manual XROMM trigger, with series of images later converted to video for DLC processing. All videos were brightened using the OpenCV algorithm for contrast limited adaptive histogram equalization (CLAHE) prior to labeling. We labeled 11 body parts in DLC – two labels on the torso and three on each of the upper arm, forearm, and hand (Fig. 1B). Locations of each label were chosen to be as close as possible to the approximate location of XROMM beads, although concessions had to be made to ensure the location was not occluded consistently in the recordings. We used DLC 2.2 with in-house modifications to produce epipolar lines in image frames that were matched between the two cameras (Fig. 1C), which significantly improved human labeling accuracy by correcting gross errors and fine-tuning minor errors. We did not train a network on labels produced without the aid of epipolar lines and therefore cannot evaluate 3D error reduction using epipolar lines. However, we note that labels applied without epipolar lines on the torso were grossly inaccurate – these labels were adjusted by an average of 63 pixels and 57 pixels in camera-1 and camera-2, respectively, after implementation. The other nine labels were adjusted by an average of <1 pixel in camera-1 and 11 pixels in camera-2. This modification has been added as a command line feature in the DLC package (a guide for using epipolar lines can be found at https://deeplabcut.github.io/DeepLabCut/docs/HelperFunctions.html). Aside from this and related changes to the standard DLC process, we followed the steps outlined in Nath et al. (2019).
In the first labeling iteration we extracted 100 total frames (50/camera) across the four events for marmoset TY and 254 frames (127/camera) across seven of the 13 events for marmoset PT, which produced a labeled dataset of 354 frames. These were chosen manually to avoid wasting time labeling frames before and after reaching bouts during which much of the marmoset forelimb was entirely occluded in the second camera angle. An additional 202 frames (101/camera) were extracted using the DLC toolbox with outliers identified by the ‘jump’ algorithm and frame selection by k-means clustering. We chose the number of frames to extract for each video based on visual inspection of labeling quality and chose the start and stop parameters to extract useful frames that captured reaching bouts. In all cases, frame numbers of extracted frames were matched between cameras to enable the use of epipolar lines. This refinement step resulted in an error reduction of 0.046 cm and percent frames tracked increase of 14.7% after analysis with the chosen Anipose parameters. The final dataset consisted of 278 human-labeled timepoints from 15 of the 17 events and 10,253 timepoints from all 17 events labeled by the network only.
We used the default resnet-50 architecture for our networks with default image augmentation. We trained 3 shuffles of the first labeling iteration with a 0.95 training set fraction and used the first shuffle for the label refinement discussed above. We trained 15 total networks after one round of label refinement – three shuffles each with training fractions of 0.3, 0.5, 0.7, 0.85, and 0.95. Each network was trained for 300,000 iterations starting from the default initial weights. We evaluated each network every 10,000 iterations and selected the snapshot that produced the minimum test error across all labels for further analysis.
We chose the network to use in subsequent analyses by finding the smallest training set size that reached the threshold of human labeling error (discussed next). We then chose the median-performing network of the three shuffles at this training set size for all further analysis.
Human Labeling Error
We selected 134 frames (67/camera) across three events from the same marmoset and session to be relabeled by the original, experienced human labeler and by a second, less experienced labeler. We used the error between the new and original labels to evaluate whether the networks reached asymptotic performance, defined by the experienced human labeling error.
A custom calibration device was built to allow for calibration in both recording domains (Knorlein et al. 2016; instruction manual for small lego cube is located in the XMALab BitBucket). The device was constructed to contain a three-dimensional grid of steel beads within the structure and a two-dimensional grid of white circles on one face of the cube. Calibration of x-ray images was computed in XMALab and calibration of visible light images was computed with custom code using OpenCV. This integrated calibration device, along with the PCA-based alignment procedure described below, ensures that DLC and XROMM tracked trajectories in a common 3D coordinate system. DLC videos were accurately calibrated, with 0.42 pixels and 0.40 pixels of intrinsic calibration error for camera-1 and camera-2, respectively, and 0.63 pixels of stereo reprojection error. XROMM calibration was similarly accurate, with average intrinsic calibration error equal to 0.81 pixels and 1.38 pixels for the two cameras.
Trajectory processing with Anipose
We used Anipose to analyze videos, filter in 2D, triangulate 3D position from 2D trajectories, and apply 3D filters (see Karashchuk et al., 2021 for details). For 2D-filtering, we chose to apply a Viterbi filter followed by an autoencoder filter because the authors demonstrate this to be the most accurate combination of 2D filters. For triangulation and 3D filtering, we enabled optimization during triangulation and enabled spatial constraints for each set of three points on the hand, forearm, and upper arm, and for the pair of points on the torso. We identified six Anipose parameters and one post-processing parameter that may affect the final accuracy of DLC+Anipose tracking and ran a parameter sweep to find the optimal combination. In 2D filtering, we varied the number of bad points that could be back-filled into the Viterbi filter (“n-back”) and the offset threshold beyond which a label was considered to have jumped from the filter. We varied four parameters in 3D processing, including the weight applied to spatial constraints (“scale_length”) and a smoothing factor (“scale_smooth”), the reprojection error threshold used during triangulation optimization, and the score threshold used as a cutoff for 2D points prior to triangulation. We also varied our own post-processing reprojection error threshold that filtered the outputs of DLC+anipose. We tested 3,456 parameter combinations in total, the details of which will be discussed below. We generally chose parameter values centered around those described in Anipose documentation and in Karashchuk et al. (2021).
Post-processing of DLC+Anipose trajectories
To process the 3D pose outputs from Anipose, we first used the reprojection error between cameras provided by Anipose to filter out obviously bad frames. We tested two thresholds, 10 and 20 pixels, for 15 of 17 events. We tested much higher thresholds, 25 and 35 pixels, for the final two events of 2019-04-14 because the calibration was poor in these events – we suspect one of the cameras was bumped prior to these events. Next, we deleted brief segments of five or fewer frames and stitched together longer segments separated by fewer than 30 frames. Importantly, we did not have to do any further interpolation to stitch segments together, as Anipose produces a continuous 3D trajectory. Together, these steps remove portions of trajectories captured when the marmoset was chewing or otherwise disengaged from the foraging task and outside of the usable region of interest in camera-2 and combined segments during foraging bouts that were separated only by brief occlusions or minor tracking errors. All steps were performed independently for each label and event.
DLC labels could not be applied to the upper limb and torso in spots corresponding exactly to XROMM bead locations because those locations would often be obstructed from view by the marmoset’s own body in one of the camera angles. We therefore applied labels as close as possible to the correct spots and subtracted the average position from each label and bead during post-processing. This removes a constant offset that should not be included in the DLC error calculations.
Despite our best efforts to place DLC and XROMM in the same 3D coordinate system through the calibration process described above, we found the two systems to be slightly misaligned. To fix this, we computed the three principal components across good frames for all DLC+Anipose labels and separately for all XROMM markers, then projected the mean-subtracted DLC+Anipose and XROMM trajectories onto their respective principal components. We found that this brought the coordinate systems into close alignment, such that we could no longer identify any systematic error that could be attributed to misalignment.
Finally, we found that there was a brief delay ranging from 0 to 10 frames between pedal-triggered onset of the XROMM event and the corresponding pedal-triggered TTL pulse initiating the start of the event for the FLIR cameras (and for the pulse ending the event). To adjust for the timing difference, we iterated over a range of possible sample shifts separately for each event to find the shift that minimized the mean absolute error between the DLC+Anipose and XROMM trajectory. We visually inspected each trajectory after the adjustment to ensure the shift was qualitatively accurate.
Evaluation of DLC Performance
We computed the median and mean absolute error between matched trajectories from DLC+Anipose and XROMM for all body parts across all reaching events. We also computed the percent of motion tracked across all labels and all active segments of reaching events. To define active segments, we manually inspected the videos for the first and last frames in each event for which the marmoset was engaged in the task; as mentioned before, the position of camera-2 prevented accurate human labeling when the marmoset was positioned well behind the partition and the vast majority of these fames are discarded by Anipose and in post-processing.
Since the error distributions are right-skewed with long tails of large errors, we use the median error to describe the center of each distribution and the Mann-Whitney U-Test to assess statistical significance. The P-values computed with this method are artificially low due to the large sample size (e.g. 27,630 samples for the three upper arm markers and 11,480 samples for the two torso markers), so we report the correlation effect size defined by the rank-biserial correlation to describe statistical differences between distributions. According to convention, we consider r < 0.20 to be a negligible effect (Cohen, 1992).
In order to determine which of the Anipose and post-processing parameters from the parameter sweep significantly affected either the median error or percent of frames tracked, we created two linear regression models using the six parameters and a constant as independent variables and either error or percent tracked as the dependent variable. We tested the effect of individual parameters by calculating the log likelihood ratio Chi-squared test statistic (LR) between the full model and each nested model created by leaving one parameter out at a time (such that each nested model had a constant term and six parameter terms). We computed the p-value of each comparison using a Chi-squared test with two degrees of freedom.
We also created a full interaction model with the seven individual parameter terms and all possible first-order interaction terms. We tested the significance of each term by the same method.
Please see README.txt for usage notes.
National Institutes of Health, Award: R01NS104898
National Institutes of Health, Award: 1F31NS118950-01
National Science Foundation, Award: MRI1338036
National Science Foundation, Award: MRI1626552