## Peromyscus pup vocal evolution Dataset --- ### How to use This dataset contains raw audio recordings and processed data used to perform analyses and generate figures from Jourjine et al. Current Biology 2023 The files are stored in compressed directories. To uncompress them, double click on them or run the following in the command line `tar -xvf full/path/to/file.tar.gz` The contents of these directories are described below. Please see the github repository `https://github.com/nickjourjine/peromyscus-pup-vocal-evolution` for code and instructions about how to use them to reproduce analyses and figures. ### Two-letter codes We use the following two-letter codes as short hand to refer to each taxon we analyze: |code|taxon | |--- |-------------------------------------| |BW | *P. maniculatus bairdii* | |BK | *P. maniculatus gambelli* | |SW | *P. maniculatus rubidus* | |NB | *P. maniculatus nubiterrae* | |PO | *P. polionotus subgriseus* | |LO | *P. polionotus leucocephalus* | |GO | *P. gossypinus* | |LL | *P. leucopus* | |MU | *Mus musculus domesticus* (C57BL6/J)| |MZ | *Mus musculus domesticus* (wild) | ### Audio datasets There are four sets of recordings that constitute the raw data: development, cross foster, F1, and F2. Because of file upload limitations, the development and F2 datasets are split into parts. All of the directories with the pre-fix "bwpof2" belong to the F2 dataset (six directories, split into approximately 100 recordings per directory, one .wav file per recorded pup). All of the directories with the pre-fix "development" belong to the development dataset (ten directories, one directory per taxon, one .wav file per recorded pup). The easiest way to use the development and F2 datasets is to unzip each of these subdirectories and collect all of the .wav files for each dataset into its own directory (i.e., one for all of the development wav files and one for all of the F2 wav files). All of the directories containing unprocessed raw audio are described in the table below: |file |dataset |file type(s)|number of files|associated main figure(s)|description |---------------------|-----------------|------------|---------------|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |developmentBK.tar.gz | development |.wav |98 |1,2 |Audio recordings of isolation induced *P. maniculatus gambelli* pup vocalizations between postnatal days 1 and 13 (day of birth = day 0) | |developmentBW.tar.gz | development |.wav |80 |1,2 |Audio recordings of isolation induced *P. maniculatus bairdii* pup vocalizations between postnatal days 1 and 13 (day of birth = day 0) | |developmentNB.tar.gz | development |.wav |72 |1,2 |Audio recordings of isolation induced *P. maniculatus nubiterrae* pup vocalizations between postnatal days 1 and 13 (day of birth = day 0) | |developmentSW.tar.gz | development |.wav |73 |1,2 |Audio recordings of isolation induced *P. maniculatus rubidus* pup vocalizations between postnatal days 1 and 13 (day of birth = day 0) | |developmentPO.tar.gz | development |.wav |76 |1,2 |Audio recordings of isolation induced *P. polionotus subgriseus* pup vocalizations between postnatal days 1 and 13 (day of birth = day 0) | |developmentLO.tar.gz | development |.wav |66 |1,2 |Audio recordings of isolation induced *P. polionotus leucocephalus* pup vocalizations between postnatal days 1 and 13 (day of birth = day 0) | |developmentGO.tar.gz | development |.wav |68 |1,2 |Audio recordings of isolation induced *P. gossypinus* pup vocalizations between postnatal days 1 and 13 (day of birth = day 0) | |developmentLL.tar.gz | development |.wav |63 |1,2 |Audio recordings of isolation induced *P. leucopus* pup vocalizations between postnatal days 1 and 13 (day of birth = day 0) | |developmentMU.tar.gz | development |.wav |116 |1,2 |Audio recordings of isolation induced *Mus musculus domesticus* (C57BL6/J) pup vocalizations between postnatal days 1 and 13 (day of birth = day 0) | |developmentMZ.tar.gz | development |.wav |111 |1,2 |Audio recordings of isolation induced *Mus musculus domesticus* (wild) pup vocalizations between postnatal days 1 and 13 (day of birth = day 0) | |bw_po_cf.tar.gz | cross foster |.wav |58 |4 |Audio recordings of isolaton induced *P. maniculatus bairdii* and *P. polionotus subgriseus* pup vocalizations at postnatal day 9 either raised by their own parents or cross fostered | |bw_po_f1.tar.gz | F1 recordings |.wav |119 |5 |Audio recordings of isolaton induced pup vocalizations from first generation hybrids between *P. maniculatus bairdii* and *P. polionotus subgriseus* | |bwpof2-1.tar.gz | F2 recordings |.wav |100 |5 |Audio recordings of isolaton induced pup vocalizations from second generation hybrids between *P. maniculatus bairdii* and *P. polionotus subgriseus* | |bwpof2-2.tar.gz | F2 recordings |.wav |100 |5 |Audio recordings of isolaton induced pup vocalizations from second generation hybrids between *P. maniculatus bairdii* and *P. polionotus subgriseus* | |bwpof2-3.tar.gz | F2 recordings |.wav |100 |5 |Audio recordings of isolaton induced pup vocalizations from second generation hybrids between *P. maniculatus bairdii* and *P. polionotus subgriseus* | |bwpof2-4.tar.gz | F2 recordings |.wav |100 |5 |Audio recordings of isolaton induced pup vocalizations from second generation hybrids between *P. maniculatus bairdii* and *P. polionotus subgriseus* | |bwpof2-5.tar.gz | F2 recordings |.wav |100 |5 |Audio recordings of isolaton induced pup vocalizations from second generation hybrids between *P. maniculatus bairdii* and *P. polionotus subgriseus* | |bwpof2-6.tar.gz | F2 recordings |.wav |117 |5 |Audio recordings of isolaton induced pup vocalizations from second generation hybrids between *P. maniculatus bairdii* and *P. polionotus subgriseus* | ### File name conventions Each wav file in the above directories is named using the convention of separating specific information about the pup whose vocalizations it contains by underscores ('_'). These conventions are described in the tables below, where index refers to the position of the list generated by splitting the file name by the '_' character (e.g., using the split() method in python). #### Developmental data set file naming conventions |index | description |------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------| |0 |species (using 2-letter code described in table above) | |1 |ID of the litter's dam and sire in the format 'damIDxsireID' | |2 |litter number from this dam and sire (this value is approximate and may not be accurate for every pup) | |3 |pup number from the litter (order in which pups were removed from their home cage for recording, using the convention that the first pup removed is pup1) | |4 |microphone channel used to record the pup | |5 |weight of the pup in milligrams | |6 |sex of the pup determined by anogenital distance (m=male, f=female) | |7 |temperature of the pup in degrees C immediately before recording (multiplied by 10 to avoid introducing '.') | |8 |temperature of the pup in degrees C immediately after recording (multiplied by 10 to avoid introducing '.') | |9 |whether or not the pup had to be removed from the dam by the experimenter (fr0=no, fr1=yes; fr stands for 'forcibly removed' while suckling) | |10 |age of the pup in days in the format 'p#' where # is the age counting day of birth as day 0 | |11 |date of the recording in the format yyyy-mm-dd | |12 |time of the recording in the format hh-mm-ss | #### Cross foster data set file naming conventions |index | description |------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |0 |species (using 2-letter code described in table above). CF-BW indicates a BW pup fostered by PO. CF-PO indicates a PO pup fostered by BW. | |1 |if pup was not cross fostered, this is the ID of the litter's dam and sire in the format 'damIDxsireID'. If cross fostered, it is the ID of the litter's dam and sire in the format 'damIDxsireID' followed by a '-' then the ID of the foster dam and sire in the format 'cfdamID-cfsireID'| |2 |pup number from the litter (order in which pups were removed from their home cage for recording, using the convention that the first pup removed is pup1) | |3 |age of the pup in days in the format 'p##' where number is the age counting day of birth as day 0 (all are p09, i.e. postnatal day 9) | |4 |weight of the pup in milligrams | |5 |sex of the pup determined by anogenital distance (m=male, f=female) | |6 |number of pups in the litter the pup came from | |7 |date of the recording in the format yyyy-mm-dd | |8 |time of the recording in the format hh-mm-ss | #### F1 data set file naming conventions |index | description |------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |0 |species where BW-PO-cross-F1 indicates a first generation hybrid, cross-BW indicates *P. maniculatus bairdii* and cross-PO indicates *P. polionotus subgriseus* | |1 |ID of the litter's dam and sire. If not F1, in the format 'damIDxsireID'. If F1, the format is 'damIDxsireID-family-N', where N is one of A, B, C, or D and indicates which of four independent crosses between *P. maniculatus bairdii* and *P. polionotus subgriseus* the F1 pup came from | |2 |litter number from this dam and sire (this value is approximate and may not be accurate for every pup) | |3 |pup number from the litter (order in which pups were removed from their home cage for recording, using the convention that the first pup removed is pup1) | |4 |microphone channel used to record the pup | |5 |weight of the pup in milligrams | |6 |sex of the pup determined by anogenital distance (m=male, f=female) | |7 |temperature of the pup in degrees C immediately before recording (multiplied by 10 to avoid introducing '.') | |8 |temperature of the pup in degrees C immediately after recording (multiplied by 10 to avoid introducing '.') | |9 |whether or not the pup had to be removed from the dam by the experimenter (fr0=no, fr1=yes; fr stands for 'forcibly removed' while suckling) | |10 |age of the pup in days in the format 'p#' where # is the age counting day of birth as day 0 (all are p9, i.e. postnatal day 9) | |11 |date of the recording in the format yyyy-mm-dd | |12 |time of the recording in the format hh-mm-ss | | #### F2 data set file naming conventions |index | description |------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |0 |microphone channel used to record the pup. This is a copy of the information in index 6 added automatically by Avisoft recording software. | |1 |species where BW-PO-cross-F2 indicates a second generation hybrid (all files in this dataset are the same at this index since there are all F2 pups) | |2 |ID of the litter's dam and sire in the format 'damIDxsireID'. | |3 |family the pup came from in the format 'fam-N#' where N is the founder family its parents came from and # is the F1 breeding pair number from that family (e.g, third F1 pair generated from founder pair A is fam-A3) | |4 |litter number from the dam and sire (this value is approximate and may not be accurate for every pup) | |5 |pup number from the litter (order in which pups were removed from their home cage for recording, using the convention that the first pup removed is pup1) | |6 |microphone channel used to record the pup | |7 |weight of the pup in milligrams | |8 |sex of the pup determined by anogenital distance (m=male, f=female) | |9 |temperature of the pup in degrees C immediately before recording (multiplied by 10 to avoid introducing '.') | |10 |temperature of the pup in degrees C immediately after recording (multiplied by 10 to avoid introducing '.') | |11 |whether or not the pup had to be removed from the dam by the experimenter (fr0=no, fr1=yes; fr stands for 'forcibly removed' while suckling) | |12 |age of the pup in days in the format 'p#' where number is the age counting day of birth as day 0 (all are p9, i.e. postnatal day 9) | |13 |date of the recording in the format yyyy-mm-dd | |14 |time of the recording in the format hh-mm-ss ### Processed data `processed_data.tar.gz` contains processed data (data tables and machine learning models) used to make figures. These files were generated using the code in the github repository `https://github.com/nickjourjine/peromyscus-pup-vocal-evolution` and are organized into sub-directories, one for each main and supplemental figure. Each of these directories is described in the table below along with a reference to the related notebook and markdown section at `https://github.com/nickjourjine/peromyscus-pup-vocal-evolution` that uses the data. Please refer to the README.md file in that repository for additional details about how to generate and use the data in these files. |directory |file(s) |description |related notebooks (and markdown section) | |----------------------------------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------| |figure_1/umap_embeddings |all_species_HDBSCAN_labels.csv |data table of umap embedding cluster labels where each row is a vocalization and columns correspond to vocalization wav file name ('source_file'), HDBSCAN cluster ('label'), and species (using 2-letter codes above) - used to make figure 1 panel C |Analyze Vocalizations.ipynb (sections 2.1 and 2.2) | |figure_1/umap_embeddings |NN_embedding_coordinates.feather (10 files, one per taxon) |where NN is one of the 10 2-letter species codes above - these are tables where each row is a linearized spectrogram (one per vocalization) and columns are pixel numbers and umap embedding coordinates - used to make figure 1 panel C |Analyze Vocalizations.ipynb (sections 2.3, 2.4, and 2.5) | |figure_1/acoustic_features |all_species_warbler_features.csv |data table of acoustic features used to generate Figures 1D, E, and F where each row corresponds to a vocalization and columns are features |Analyze Vocalizations.ipynb (sections 2.3, 2.4, and 2.5) | |figure_1/acoustic_features |all_noise_floors.csv |data table of spectrogram pixel values defining threshold for background noise for each vocalization in the development dataset, generated by Spectrogramming and UMAP.ipynb notebook |Analyze Vocalizations.ipynb (sections 2.2 and 3.4) | |figure_1/acoustic_features |_recording_lengths.json |where data_set is one of bw_po_cf (cross foster dataset), bw_po_f1 (F1 dataset), bw_po_f2 (F2 dataset), or development (development dataset) - dictonaries of recording lengths for each recording, used to determine vocalization rates without recalculating recording lengths each time |Analyze Vocalizations.ipynb (section 3) | |figure_2 |annotated_vocalizations.csv |data table of the vocalizations annotated in the Annotate from UMAP.ipynb notebook where each row is a vocalization and columns are vocalization wav file name ('source_file'), umap embedding coordinates('umap1' and 'umap2'), hdbscan label ('hdbscan_label'), annotated label ('human_label'), and species |Train Models on Features.ipynb (sections 2, 3, and 4); Analyze Vocalizations.ipynb (section 3.6) | |figure_2 |development_vocalizations_clipping_levels.csv |data table of clipping levels for the development dataset calculated in Segmentation and UMAP.ipynb - each row is a vocalization and columns are vocalization wav file name ('source_file'), percent audio that is clipped ('percent_clipped'), and clipping threshold ('clipping_threshold') |Analyze Vocalizations.ipynb (section 3) | |figure_2 |figure_2A_data.csv |data table where each row is a vocalization and columns are acoustic features and annotated labels (cry or USV) - used to train the models evaluated in figure 2A |Train Models on Features.ipynb (section 3.1, 3.2, and 3.3) | |figure_2 |figure_2B_data.csv |data table of performance metrics for random forest models trained on varying numbers of vocalizations from each taxon - used to train the models evaluated in figure 2B |Train Models on Features.ipynb (section 3.5 and 3.6) | |figure_2 |random_forest_model_cry.pkl |random forest model evaluated in Figure 2A, left |Train Models on Features.ipynb (section 3.4) | |figure_2 |random_forest_model_USV.pkl |random forest model evaluated in Figure 2A, right |Train Models on Features.ipynb (section 3.4) | |figure_2 |figure2CD_pups_data.csv |data table where each row is a pup and columns are aggregate acoustic features - used to generate the vocalization rate panels in figure 2 panels C and D |Analyze Vocalizations.ipynb (section 3.7) | |figure_2 |figure2CD_vocs_data.csv |data table where each row is a vocalization and columns are acoustic features - used to generate the duration and mean frequency panels in figure 2 panels C and D |Analyze Vocalizations.ipynb (section 3.7) | |figure_3 |playback_data.csv |data table where each row is a dam and columns are descriptive statistics of dam behavior during audio playback of cries and USVs |Analyze Playback.ipynb (sections 3 and 4) | |figure_4 |all_bw_po_cf_clipping.csv |data table of clipping levels for the cross foster dataset calculated in Segmentation and UMAP.ipynb - each row is a vocalization and columns are vocalization wav file name ('source_file'), percent audio that is clipped ('percent_clipped'), and clipping threshold ('clipping_threshold') |Analyze Vocalizations.ipynb (section 4) | |figure_4 |figure4_pups_data.csv |data table of acoustic features aggregated by pup where each row is a pup and columns are features - used to generate figure 4 panels B and D |Analyze Vocalizations.ipynb (section 4) | |figure_4 |figure4_pups_cry_pca.csv |data table of cry acoustic features and PCA coordinates - used to generate figure 4 panel C |Analyze Vocalizations.ipynb (section 4) | |figure_4 |figure4_pups_USV_pca.csv |data table of USV acoustic features and PCA coordinates - used to generate figure 4 panel E |Analyze Vocalizations.ipynb (section 4) | |figure_5 |all_bw_po_f1_clipping.csv |data table of clipping levels for vocalizations in the F1 dataset calculated in Segmentation and UMAP.ipynb - each row is a vocalization and columns are vocalization wav file name ('source_file'), percent audio that is clipped ('percent_clipped'), and clipping threshold ('clipping_threshold') |Analyze Vocalizations.ipynb (section 5) | |figure_5 |all_bw_po_f2_clipping.csv |data table of clipping levels for vocalizations in the F2 dataset calculated in Segmentation and UMAP.ipynb - each row is a vocalization and columns are vocalization wav file name ('source_file'), percent audio that is clipped ('percent_clipped'), and clipping threshold ('clipping_threshold') |Analyze Vocalizations.ipynb (section 5) | |figure_5 |figure5_pups_data.csv |data table of acoustic features aggregated by pup where each row is a pup and columns are features - used to generate figure 5 panels B, D, F, G, and H |Analyze Vocalizations.ipynb (sections 5.4 and 5.5) | |figure_5 |figure5_pups_cry_pca.csv |data table of cry acoustic features and PCA coordinates - used to generate figure 5 panel C |Analyze Vocalizations.ipynb (sections 5.4) | |figure_5 |figure5_pups_USV_pca.csv |data table of USV acoustic features and PCA coordinates - used to generate figure 5 panel E |Analyze Vocalizations.ipynb (sections 5.4) | |supplemental_figure_1/umap_embeddings |NN_embedding_coordinates_labeled.feather (10 files, one per taxon) |where NN is one of the 10 2-letter species codes above - these are copies of the files in figure_1/umap_embeddings but with a column for the label given by hdbscan in the Annotate from UMAP.ipynb notebook - used to generate supplemental figure 1 |Analyze Vocalizations.ipynb (section 6) | |supplemental_figure_1/acoustic_features |NNwarbler_features.csv (10 files, one per taxon) |where NN is one of the 10 2-letter species codes above - these are copies of the data in figure_1/acoustic_features/all_species_warbler_features.csv but split up into one csv per species - used to generate supplemental figure 1 |Analyze Vocalizations.ipynb (section 6) | |supplemental_figure_1/acoustic_features |all_development_SPL.csv |data table where each row is a vocalization and columns are vocalization wav file name ('source_file'), species, and sound pressure level (SPL) calculated with warbleR - used to generate supplemental figure 1 |Analyze Vocalizations.ipynb (section 6) | |supplemental_figure_2 |supplement_figure2_cry_vocs_data.csv |data table of cry acoustic features and PCA coordinates - used to generate supplemental figure 2 panels B and D |Analyze Vocalizations.ipynb (section 3.8) | |supplemental_figure_2 |supplement_figure2_USV_vocs_data.csv |data table of USV acoustic features and PCA coordinates - used to generate supplemental figure 2 panels C and E |Analyze Vocalizations.ipynb (section 3.8) | |supplemental_figure_3 |nonvocal_acoustic_features.csv |data table of nonvocal sounds (one per row) and acoustic features (columns) used to generate supplemental_figure_3_data.csv |Train Models on Features.ipynb (section 4); Analyze Vocalizations.ipynb (section 3.4) | |supplemental_figure_3 |supplemental_figure_3_data.csv |data table of vocalizations (one per row) and acoustic features (columns) used to train random forest model for predicting 'cry' and 'USV' labels in figures 2, 4, and 5 |Train Models on Features.ipynb (section 4); Analyze Vocalizations.ipynb (section 3.4) | |supplemental_figure_3 |random_forest_voc_type_model.pkl |random forest model trained on the data in supplemental_figure_3_data.csv |Train Models on Features.ipynb (section 4); Analyze Vocalizations.ipynb (section 3.4) | |supplemental_figure_4 |figure2CD_pups_data.csv |this is a copy of the data table described above (figure_2/figure2CD_pups_data.csv) - used to generate supplemental figure 4 |Analyze Vocalizations.ipynb (section 7) | |supplemental_figure_4 |figure2CD_vocs_data.csv |this is a copy of the data table described above (figure_2/figure2CD_vocs_data.csv) - used to generate supplemental figure 4 |Analyze Vocalizations.ipynb (section 7) | |supplemental_figure_5 |all_development_vocs_with_predictions.csv |data table of vocalizations in the development dataset where each row is a vocalization and columns are acoustic features and the label predicted by the model supplemental_figure_3/random_forest_voc_type_model.pkl - used to generate supplemental_figure_5_data.csv |Analyze Vocalizations.ipynb (section 8) | |supplemental_figure_5 |all_development_vocs_with_start_stop_times.csv |data table of vocalizations in the development dataset where each row is a vocalization and columns are wav file the vocalization came from ('source_file'), its start and stop time in that wav file ('start_seconds' and 'stop_seconds'), and species - used to generate supplemental_figure_5_data.csv |Analyze Vocalizations.ipynb (section 8) | |supplemental_figure_5 |supplemental_figure_5_data.csv |data table of interonset intervals for vocalizations in the development dataset - used to generate supplemental figure 5 panels B and C |Analyze Vocalizations.ipynb (section 8) |