Illuminating the mystery of thylacine extinction: A role for relaxed selection and gene loss
Data files
Jul 29, 2025 version files 30.47 GB
-
README.md
12.82 KB
-
Thylacine_Gene_Loss.zip
30.47 GB
Abstract
Gene loss shapes lineage-specific traits but is often overlooked in species survival. In this study, we investigate the role of ancestral gene loss using the extinct icon─thylacine (Thylacinus cynocephalus). While studies of neutral genetic variation indicate a population decline before extinction, the impact of thylacine-specific ancestral gene losses remains unexplored. The availability of a chromosomal-level genome of the extinct thylacine offers a unique opportunity for such comparative studies. Here, we leverage palaeogenomic data to compare gene presence/absence patterns between the Tasmanian devil and thylacine. We discovered ancestral (between 13-1 Ma) loss of SAMD9L, HSD17B13, CUZD1, and VWA7 due to multiple gene-inactivating mutations, corroborated by short-read sequencing. The timing of gene loss mirrors the thylacine’s shift toward hypercarnivory and increased body size. Notably, the loss of SAMD9 correlates with a carnivorous diet. Our genome-wide analysis reveals olfactory receptor loss and relaxed selection, aligning with reduced olfactory lobes in the thylacine, indicating olfaction is not its primary hunting sense. By integrating palaeogenomic data with comparative genomics, our study reveals ancestral gene losses and their impact on species survival and resilience to environmental changes. Our approach can be extended to other extinct and endangered species, helping to identify genetic factors for conservation efforts.
https://doi.org/10.5061/dryad.0k6djhbcm
This repository contains the data for the paper:
"Illuminating the mystery of thylacine extinction: a role for relaxed selection and gene loss"
DOI: 10.1098/rspb.2025.1339
Folder Structure
- Main folder: Thylacine_Gene_Loss.zip
- Chromosome_wise_Chains_and_TOGA_results: Contains the output files of make_lastz_chains and TOGA.
- Purpose: Contains input, intermediate, and final output files from genome alignment and ortholog projection using LASTZ and TOGA.
- Key Tools:
- make_lastz_chains (https://github.com/hillerlab/make_lastz_chains): pairwise genome alignment
- TOGA (https://github.com/hillerlab/TOGA): identifies orthologous genes and infers gene status
- Data:
- Chain files: pairwise genomic alignment between Tasmanian devil and Tasmanian tiger.
- TOGA outputs per chromosome: predicted gene ortholog/gene status, inactivating mutations
- Figure1: Contains the gene loss confirmation genes losses in thylacine and other marsupial species, IGV-reports, and gene loss history. (Figure 1, figure S2, figure S5, figure S7, figure S8 and figure S9)
- Purpose: Stores all data used in Figure 1 and relevant supplementary figures.
- Content:
- IGV screenshots validating gene loss visually
- Gene loss summary tables
- Gene loss history across lineages
- Figure2: Contains the figure 2 figures and the inputs required to get figures like phylogenetic trees from TimeTree and iTOLs. (Figure 2)
- Content:
- Phylogenetic trees generated using the TimeTree website (www.timetree.org).
- Tree annotations and visualisations with iTOL (https://itol.embl.de/).
- Figure3:
- (a) Contains the input and output files of GC-content variation in thylacine. (Figure 3a)
- Analyses genome-wide GC content of genes, especially in lost genes and gene status inferred using TOGA.
- Input: gene coordinates and sequences
- Output: scatter plot, box plots, violin plots (GC% V/s GC stretches per gene), and codes associated with these plots.
- (b) Selection pressure analysis of "clearly lost" genes (found using TOGA), tested using RELAX. (Figure 3b)
- Uses RELAX to test for relaxation or intensification selection
- Input: codon alignments (built using TOGA) and gene trees (obtained from the TimeTree website).
- Output: Test statistics (k, p-values), model fits, output files.
- (c) Expression of lost genes in Devil and Dunnart transcriptome. (Figure 3c)
- Compares expression of lost genes in Devil and Dunnart using RNA-seq data.
- Input: Gene read counts (from kallisto), normalised expression.
- Output: Heatmaps showing tissue-specific expression.
- Gene_Tree_SAMD9-9L: Contains the alignment and tree files used to obtain gene trees. (figure S3)
- Hybpiper_patchwork: Contains the input files for Hybpiper and patchwork. The subset of fastq reads gives hits (BLASTn) to lost genes (as query sequences).
- Hybpiper_SAMD9-9L: In-frame stop codon in SAMD9L reassessed using short-read data of thylacine, using SAMD9 and SAMD9L coding sequences of other 20 marsupial species.
- Purpose: Re-evaluate premature stop codons in SAMD9L using thylacine short-read data and 20 marsupial orthologs
- Content:
- Alignment of CDS from other marsupials.
- Read-based confirmation of frameshifts or nonsense mutations.
- IGV_reports-BWA+BLASTn: Contains the IGV-reports for clearly lost genes in chromosome-wise folders (Figure 1)
- Purpose: Visualisation of "clearly lost" genes using Integrated Genomics Viewer (IGV)
- Input: Short-read BAM files (built using blastn 17 output format or mapped with bwa), genome references of Tasmanian devils.
- Figures: Visual validation of frame-disrupting changes shown in Figure 1.
- Loss_confirmation: Contains the output files of short-read BLASTn, bam-read count, and gene-loss events found by TOGA to generate a summary table and confirm gene loss. (Figure 1, table S1)
- Purpose: Independent confirmation of gene losses detected by TOGA
- Components:
- bam-readcount results: coverage and mutation confirmation
- BLASTn alignments: confirm absence/presence of genes
- Summary table: includes gene ID, status, evidence, and references
- Main_Figures: contains the biorender permission for publication.
- Myrmecobius_fasciatus_RNA_seq: Contains an IGV screenshot for the VWA7 gene, subseted bam files, and BED files. (figure S10)
- PAML_omega_SAMD9-9L: Evolutionary rate dN/dS ratio (omega) across the SAMD9 and SAMD9L genes. Contains the input and R script. (figure S4)
- Purpose: Molecular evolution analysis of SAMD9/9L using PAML
- Output:
- ω (dN/dS) values
- Selection test comparisons
- Scripts: R code used to parse and plot outputs
- Selection_analysis: Selection pressure analysis of lost gene across 21 marsupial species, tested using RELAX, codeML, MEME, FEL, aBSREL, BUSTED, and gBGC. (Figure 3 and table S5)
- Comprehensive selection pressure tests across 21 marsupial species
- Tools Used:
- RELAX, codeML, MEME, FEL, aBSREL, BUSTED, gBGC
- Input: gene trees, codon alignments
- Output: statistical test results, visualisation files
- Thylacine_transcriptome-miRNA: Contains BLASTn search output for lost genes and control genes in the miRNA database of thylacine.
- Purpose: Evaluates whether lost genes are expressed in the transcriptome or not
- Method: BLASTn against thylacine miRNA database
- TOGA_output_sorted: Contains the sorted output files of TOGA, such as inact_mut_data.txt and loss_summ_data.tsv, codon.fasta, nucleotide.fasta.tgz, prot.fasta, query_annotation.bed, and query_gene_spans.bed. Also, a mutation plot was generated using TOGA. (Figure 1)
- Cleaned and sorted outputs from TOGA
- Key files:
- inact_mut_data.txt: details on inactivating mutations
- loss_summ_data.tsv: summary of lost/retained genes (I, M, PI, UL, L, PM)
- codon.fasta, prot.fasta: sequences used for alignment of reference and query species.
- query_annotation.bed: gene annotation intervals
- Includes mutation plots obtained by running TOGA.
- Synteny: Contains the NCBI synteny images for SAMD9L, HSD17B13, CUZD1 and VWA7. (figure S5 and tables S2)
- Screenshots from NCBI Genome Data Viewer (GDV) for gene SAMD9L, CUZD1, HSD17B13, and VWA7.
- Used to identify 1-to-1 gene orthologs.
- SAMD9_gene_loss_in_carnivore: Contains the script for gene loss events identification by TOGA. Assembly verification output files of klumpy and IGV reports of events by short read for dingo and tiger. (figure S6)
- Validates gene loss events for SAMD9 in carnivores
- Contains:
- TOGA gene-loss identification script
- Assembly validation (klumpy)
- IGV-based confirmations
- R1_RSPB: Includes input, output, and code for paralog and GC-content comparison, revalidation of gene loss events, RNA-seq data from Macropus eugenii, and phylogenetic correlation analysis (Figure 2b)
- Supporting analyses for:
- Paralog vs. GC-content trends
- RNA-seq revalidation from Macropus eugenii
- Phylogenetic correlation of gene loss and carnivorous diet.
- Details of tools, version, and purpose
Sr. No. | Tool | Version | Purpose |
---|---|---|---|
1 | LASTZ (called from make_lastz_chains pipeline) | v1.04.15 | Whole-genome alignment |
2 | TOGA | v1.1.7 | Gene loss inference |
3 | SAMtools | v1.15.1 | BAM/SAM manipulation |
4 | BLASTn | v2.13.0 | Sequence similarity search |
5 | IGV-report | v1.12.0 | Visual variant inspection |
6 | BWA-MEM | v0.7.17 | Mapping of whole genome sequencing reads |
7 | minimap2 | v2.26 | |
8 | bowtie2 | v2.5.4 | |
9 | bam-readcount | v1.0.1 | Quantifying read support for events |
10 | efetch/ esearch | v12 | Extract accession-specific details/data from NCBI |
11 | Seqkit | v2.2.0 | Toolkit for FASTA/Q file manipulation |
12 | pilon | v1.24 | Genome polishing using raw read data |
13 | HybPiper | v2.3.1 | Recover gene sequence from WGS and infers in-frame stop codons |
14 | Patchwork | v0.5.2 | Recover gene sequence from WGS |
15 | seqtk | v1.2 | Toolkit for FASTA/Q file manipulation |
16 | NCBI datasets | 13.43.2 | A command-line tool that is used to query and download biological sequence data across all domains of life from NCBI databases |
17 | BEDTools | v2.27.1 | Genomic interval operations |
18 | klumpy | v1.0.11 | A tool to evaluate the integrity of long-read genome assemblies and illusive sequence motifs |
19 | Kallisto | v0.51.0 | Program for quantifying abundances of transcripts from RNA-Seq data |
20 | STAR | v2.7.0d | Program for splice-aware mapping of RNA-seq reads to the genome |
21 | PRANK | v.170427 | Multiple sequence alignment tool |
22 | GUIDANCE2 | v2.01 | Suit for codon-aware alignment with bootstrap support |
23 | HYPHY | 2.5.48(MP) and 2.5.62(MP) | Selection analysis tools |
24 | PAML (codeml) | v4.9f | dN/dS estimation across branches |
25 | MapNH | v1.3.0 | GC biased gene conversion |
26 | phastBias | v1.6 | GC biased gene conversion |
27 | IQ-TREE | v2.3.6 | Phylogenetic analysis |
Prerequisites:
- TOGA (1.1.7)
- make_lastz_chains (https://github.com/hillerlab/make_lastz_chains.git)
- PAML (4.9f)
- BLASTN (2.13.0)
- phastBias(1.6)
- mapnh(1.3.0)
- HYPHY 2.5.48(MP) and 2.5.62(MP)
- BWA(0.7.17-r1188)
- Samtools (1.3)
- bedtools (v2.27.1)
- ea-utils (https://github.com/ExpressionAnalysis/ea-utils.git)
- seqtk (1.2-r94)
- Guidance (v2.01)
- PRANK (v.170427)
- IGV (2.8.13)
- igv-reports (1.12.0)
- STAR (2.7.0d)
- MegaX
- R (4.1.0)
- bam-readcount (1.0.1)
- HybPiper (2.3.1)
- Patchwork (0.5.2)
- IQ-TREE (2.3.6)
- klumpy (1.0.11)
- kallisto (0.51.0)
- Bowtie2 (2.5.4)
R packages:
- ape (5.5)
- phytools (0.7-90)
- ggplot2 (3.3.5)
- ggrepel (0.9.1)
- cowplot (1.1.1)
- dplyr (1.0.7)
- ggplotify (0.1.0)
- grid (4.1.1)
- gridExtra (2.3)
- reshape2 (1.4.4)