Genomic landscape of introgression from the ghost lineage in a gobiid fish uncovers the generality of forces shaping hybrid genomes

Published Nov 06, 2023; Updated Dec 18, 2023 on Dryad. https://doi.org/10.5061/dryad.7wm37pw09

Data files

Nov 06, 2023 version files 4.12 GB

Dec 18, 2023 version files 4.12 GB

Abstract

Extinct lineages can leave legacies in the genomes of extant lineages through ancient introgressive hybridization. The patterns of genomic survival of these extinct lineages provide insight into the role of extinct lineages in current biodiversity. However, our understanding of the genomic landscape of introgression from extinct lineages remains limited due to challenges associated with locating the traces of unsampled “ghost” extinct lineages without ancient genomes. Herein, we conducted population genomic analyses on the East China Sea (ECS) lineage of Chaenogobius annularis, which was suspected to have originated from ghost introgression, with the aim of elucidating its genomic origins and characterizing its landscape of introgression. By combining phylogeographic analysis and demographic modeling, we demonstrated that the ECS lineage originated from ancient hybridization with an extinct ghost lineage. Forward simulations based on the estimated demography indicated that the statistic γ of the HyDe analysis can be used to distinguish the differences in local introgression rates in our data. Consistent with introgression between extant organisms, we found reduced introgression from extinct lineage in regions with low-recombination rates and with functional importance, thereby suggesting a role of linked selection that has eliminated the extinct lineage in shaping the hybrid genome. Moreover, we identified enrichment of repetitive elements in regions associated with ghost introgression, which was hitherto little-known but was also observed in the reanalysis of published data on introgression between extant organisms. Overall, our findings underscore the unexpected similarities in the characteristics of introgression landscapes across different taxa, even in cases of ghost introgression.

https://doi.org/10.5061/dryad.7wm37pw09

Brief description of the data and file structure

scripts.tar.gz

Note: "scripts.tar.gz" can be obtained from the Zenodo link (https://doi.org/10.5281/zenodo.10048869) tied to this Dryad page (see "Related works" section in the upper right corner). To get the script only, please visit this Zenodo link.

For your convenience, we have changed the "scripts.tar.gz" file to be available directly from Data files in this page as well (added on 12/18/2023).

The scripts used in this study (bash, python, R).
These scripts are categorized into the following 10 contents, which are hierarchized within each directory.
1. ddRAD-seq genotyping
2. WGS (whole genome resequencing) genotyping
3. repeats and gene annotation
4. population recombination rate estimation
5. potentially deleterious SNPs
6. popultion genetic analyses
7. phylogenetic analysis
8. hybrid detection
9. demographic estimation
10. introgression landscape characterization

The detailed hierarchical structure is given below in the section "Detailed description of the file structure".

genotyping_data.tar.gz

The compressed files of directories containing genotyping data generated in this study.
Five VCF files from RAD-seq, two VCF files from whole genome resequencing data, and one FASTA file of whole mitogenomic sequences.
The scripts used to analyze the demographic modeling are stored in "/01ddRADseq_genotyping/" or "02WGS_genotyping" in the scripts.tar.gz.

annotation_data.tar.gz

The compressed files of directories containing repeat annotation data and gene annotation data in this study.
The scripts used to analyze the demographic modeling are stored in "/03repeats_and_gene_annotation/" in the scripts.tar.gz.

demographic_modeling.tar.gz

The compressed file of a directory containing the results of the demographic modeling (distribution of AIC for each model, and maximum likelihood parameters for the best model) and the input site frequency spectrum.
The scripts used to analyze the demographic modeling are stored in "/09demographic_estimation/02demographic_modeling/" in the scripts.tar.gz.

slidingwindow_results.tar.gz

The compressed file of a directory containing the results of the sliding window anlysis (bed files summarizing the statistic γ in the HyDe analysis and some other features).
Please see "README_description_of_record_XXkb.txt" in the slidingwindow_results.tar.gz for column name descriptions.
The scripts used for this analysis are stored in "/10introgression_landscape_characterization/02sliding_window/" in the scripts.tar.gz.

"scripts.tar.gz"

01ddRADseq_genotyping

"genotyping_data.tar.gz"

Filtered genotyping dataset used in this study.

01RADseq

"annotation_data.tar.gz"

01repeat_annotation
1. agohaze_sspace_x1.fa.masked.gz

"demographic_modeling.tar.gz"

01without_recent_size_change
1. sorted_AIC_dist_wo_recent.csv

"slidingwindow_results.tar.gz"

README_description_of_record_XXkb.txt

Sharing/Access information

Sequencial data used in this study can be available from DDBJ (accession numbers: DRR174909, DRR175781–DRR175796, DRR175830–DRR175860, DRR175876–DRR175955, DRR489922–DRR490073 for ddRAD-seq, DRR489903–DRR489921 for whole genome resequencing, and DRR490074–DRR490084 for RNA-seq).