Data for: The genomic landscape, causes, and consequences of extensive phylogenomic discordance in Old World mice and rats
Data files
Nov 21, 2023 version files 22.89 GB
-
01-otomys-typus-asm.tar
-
02-uce-species-tree.tar
-
03-pseudo-assemblies.tar
-
04-genomic-windows.tar
-
05-selection-tests.tar
-
README.md
-
README.txt
Abstract
A species tree is a central concept in evolutionary biology whereby a single-branching phylogeny reflects relationships among species. However, the phylogenies of different genomic regions often differ from the species tree. Although tree discordance is widespread in phylogenomic studies, we still lack a clear understanding of how variation in phylogenetic patterns is shaped by genome biology or the extent to which discordance may compromise comparative studies. We characterized patterns of phylogenomic discordance across the murine rodents (Old World mice and rats) – a large and ecologically diverse group that gave rise to the mouse and rat model systems. Combining recently published linked-read genome assemblies for seven murine species with other available rodent genomes, we first used ultra-conserved elements (UCEs) to infer a robust species tree. We then used whole genomes to examine finer-scale patterns of discordance and found that proximate chromosomal regions tended to have more similar phylogenetic histories. While we found no clear relationship between local tree similarity and recombination rates in house mice, we did observe a correlation between recombination rates and average similarity to the species tree. We also detected a strong influence of linked selection whereby purifying selection at UCEs led to less discordance. Finally, we show that assuming a single species tree can result in high error rates when testing for positive selection under different models. Collectively, our results highlight the complex relationship between phylogenetic inference and genome biology and underscore how failure to account for this complexity can mislead comparative genomic studies.
README: Data for: The genomic landscape, causes, and consequences of extensive phylogenomic discordance in Old World mice and rats
https://doi.org/10.5061/dryad.866t1g1wq
Give a brief summary of dataset contents, contextualized in experimental procedures and results.
Description of the data and file structure
This archive contains final data files for every step of the paper "The genomic landscape, causes, and consequences of extensive phylogenomic discordance in Old World mice and rats". Some intermediate files are not included.
Below is a description of each file in the project directory.
.
├── 01-otomys-typus-asm A directory containing files related to the O. typus assembly
│ ├── abyss_assembly-contigs.fa The final assembly in FASTA format
│ ├── assembly.conf The configuration file for the assembly containing the local directory with the filtered/trimmed reads (not included in repository; see Methods)
│ ├── assembly_commands The commands run to generate the assembly
│ ├── otomys.conf A configuration file for the O. typus reads
│ ├── otomys_contig A directory containing the O. typus assembly in 2bit format
│ │ ├── otomys_contig.2bit The O. typus assembly in 2bit format
│ │ └── sizes.tab A file containing the contig sizes of the O. typus assembly
│ └── otomys_contig.sqlite A database file for the O. typus assembly
This archive contains final data files for every step of the paper "The genomic landscape, causes, and consequences of extensive phylogenomic discordance in Old World mice and rats". Some intermediate files are not included.
├── 02-uce-species-tree A directory containing files related to the species tree inference with UCEs
│ ├── alignments_trimal9.tar.gz The final alignments of the UCEs for all species (see Methods)
│ ├── dating Files related to divergence time estimates on the species tree
│ │ ├── top100_high.timetree.nex The time tree using the high end of the calibration points (See Table 2)
│ │ ├── top100_low.timetree.nex The time tree using the low end of the calibration points (See Table 2)
│ │ └── top100_mid.timetree.nex The time tree using the midpoint of the calibration points (See Table 2)
│ ├── iq_genes.treefile The "gene trees" for all UCEs.
│ ├── iq_genes_bs10.treefile The "gene trees" with low-support branches collapsed (ASTRAL input)
│ ├── iq_genes_bs10_phosun_rerooted.treefile The "gene trees" with low support branches collapsed and rooted at P. sungorus (phyparts input)
│ ├── species_tree_astral A directory containing files related to species tree inference with ASTRAL
│ │ ├── astral_species.log The log file of the astral run
│ │ ├── astral_species.treefile The species tree in Newick format
│ │ └── astral_species_rerooted.treefile The species tree in Newick format rooted at P. sungorus
│ └── species_tree_iq A directory containing files related to species tree inference with concatenation with IQ-TREE
│ ├── iq_species.best_model.nex IQ-TREE output
│ ├── iq_species.best_scheme IQ-TREE output
│ ├── iq_species.best_scheme.nex IQ-TREE output
│ ├── iq_species.bionj IQ-TREE output
│ ├── iq_species.ckp.gz IQ-TREE output
│ ├── iq_species.contree IQ-TREE output
│ ├── iq_species.iqtree IQ-TREE output
│ ├── iq_species.log Log file of the IQ-TREE run
│ ├── iq_species.mldist IQ-TREE output
│ ├── iq_species.model.gz IQ-TREE output
│ ├── iq_species.splits.nex IQ-TREE output
│ ├── iq_species.treefile Inferred species tree in Newick format
│ ├── iq_species_reroot.treefile Inferred species tree in Newick format rooted at P. sungorus
│ └── speciestree_job.sh The commands used to run IQ-TREE
├── 03-pseudo-assemblies A directory containing pseudo-assemblies of the six new genomes done with three iterations of pseudo-it
│ ├── gdol A directory containing the pseudo-assembly of G. dolichurus (gdol)
│ │ ├── gdol-iter-03-softmask-final.chain The chain file from mm10 to the gdol pseudo-assembly
│ │ ├── gdol-iter-03-softmask-final.fa.fai The FASTA index for the gdol pseudo-assembly (samtools faidx)
│ │ └── gdol-iter-03-softmask-final.fa.gz The gdol pseudo-assembly in FASTA format
│ ├── hall A directory containing the pseudo-assembly of H. alleni (hall)
│ │ ├── hall-iter-03-softmask-final.chain The chain file from mm10 to the hall pseudo-assembly
│ │ ├── hall-iter-03-softmask-final.fa.fai The FASTA index for the hall pseudo-assembly (samtools faidx)
│ │ └── hall-iter-03-softmask-final.fa.gz The hall pseudo-assembly in FASTA format
│ ├── mnat A directory containing the pseudo-assembly of M. natalensis (mnat)
│ │ ├── mnat-iter-03-softmask-final.chain The chain file from mm10 to the mnat pseudo-assembly
│ │ ├── mnat-iter-03-softmask-final.fa.fai The FASTA index for the mnat pseudo-assembly (samtools faidx)
│ │ └── mnat-iter-03-softmask-final.fa.gz The mnat pseudo-assembly in FASTA format
│ ├── pdel A directory containing the pseudo-assembly of P. delectorum (pdel)
│ │ ├── pdel-iter-03-softmask-final.chain The chain file from mm10 to the pdel pseudo-assembly
│ │ ├── pdel-iter-03-softmask-final.fa.fai The FASTA index for the pdel pseudo-assembly (samtools faidx)
│ │ └── pdel-iter-03-softmask-final.fa.gz The pdel pseudo-assembly in FASTA format
│ ├── rdil A directory containing the pseudo-assembly of R. dilectus (rdil)
│ │ ├── rdil-iter-03-softmask-final.chain The chain file from mm10 to the rdil pseudo-assembly
│ │ ├── rdil-iter-03-softmask-final.fa.fai The FASTA index for the rdil pseudo-assembly (samtools faidx)
│ │ └── rdil-iter-03-softmask-final.fa.gz The rdil pseudo-assembly in FASTA format
│ └── rsor A directory containing the pseudo-assembly of R. soricoides (rsor)
│ ├── rsor-iter-03-softmask-final.chain The chain file from mm10 to the rsor pseudo-assembly
│ ├── rsor-iter-03-softmask-final.fa.fai The FASTA index for the rsor pseudo-assembly (samtools faidx)
│ └── rsor-iter-03-softmask-final.fa.gz The rsor pseudo-assembly in FASTA format
├── 04-genomic-windows Files relating to the 10kb genomic window analyses
│ ├── aln.tar.gz Alignments of 10kb genomic windows
│ │ ├── 01-mafft A directory containing the raw 10kb window alignments from MAFFT
│ │ │ └── chr*/10kb-0.5-0.5/chr*-*-mafft.fa Raw FASTA alignments from MAFFT for each 10kb window
│ │ └── 02-trimal A directory containing the trimmed 10kb window alignments from trimal
│ │ └── chr*/10kb-0.5-0.5/chr*-*-mafft-trimal.fa Trimmaed FASTA alignments from trimal for each 10kb window
│ ├── bed.tar.gz Bed files containing coordinates of the 10kb windows in each genome
│ │ ├── chr* Bed files for each 10kb window in mm10 coordinates for each individual chromosome
│ │ │ ├── gdol-chr*-10kb.bed 10kb window coordinates for G. dolichurus (gdol) for each chromosome
│ │ │ ├── gdol-chr*-10kb.bed.unmap 10kb windows that were unable to be lifted over from the mm10 assembly to the G. dolichurus (gdol) assembly
│ │ │ ├── hall-chr*-10kb.bed 10kb window coordinates for H. alleni (hall) for each chromosome
│ │ │ ├── hall-chr*-10kb.bed.unmap 10kb windows that were unable to be lifted over from the mm10 assembly to the H. alleni (hall) assembly
│ │ │ ├── mm10-chr*-10kb.bed 10kb window coordinates for M. musculus (mm10) for each chromosome
│ │ │ ├── mm10-chr*-10kb-repeats.bed 10kb windows coordinates for M. musculus (mm10) for each chromosome with repeat content per window
│ │ │ ├── mnat-chr*-10kb.bed 10kb window coordinates for M. natalensis (mnat) for each chromosome
│ │ │ ├── mnat-chr*-10kb.bed.unmap 10kb windows that were unable to be lifted over from the mm10 assembly to the M. natalensis (mnat) assembly
│ │ │ ├── pdel-chr*-10kb.bed 10kb window coordinates for P. delectorum (pdel) for each chromosome
│ │ │ ├── pdel-chr*-10kb.bed.unmap 10kb windows that were unable to be lifted over from the mm10 assembly to the P. delectorum (pdel) assembly
│ │ │ ├── rdil-chr*-10kb.bed 10kb window coordinates for R. dilectus (rdil) for each chromosome
│ │ │ ├── rdil-chr*-10kb.bed.unmap 10kb windows that were unable to be lifted over from the mm10 assembly to the R. dilectus (rdil) assembly
│ │ │ ├── rsor-chr*-10kb.bed 10kb window coordinates for R. soricoides (rsor) for each chromosome
│ │ │ └── rsor-chr*-10kb.bed.unmap 10kb windows that were unable to be lifted over from the mm10 assembly to the R. soricoides (rsor) assembly
│ │ ├── gdol-10kb.bed 10kb window coordinates for G. dolichurus (gdol) for all chromosomes
│ │ ├── gdol-10kb.bed.unmap 10kb windows that were unable to be lifted over from the mm10 assembly to the G. dolichurus (gdol) assembly
│ │ ├── gdol-10kb-seqs.bed 10kb window coordinates and sequences for G. dolichurus (gdol) for all chromosomes
│ │ ├── hall-10kb.bed 10kb window coordinates for H. alleni (hall) for all chromosomes
│ │ ├── hall-10kb.bed.unmap 10kb windows that were unable to be lifted over from the mm10 assembly to the H. alleni (hall) assembly
│ │ ├── hall-10kb-seqs.bed 10kb window coordinates and sequences for H. alleni (hall) for all chromosomes
│ │ ├── mm10-10kb.bed 10kb window coordinates for M. musculus (mm10) for all chromosomes
│ │ ├── mm10-10kb-repeat-coverage.bed 10kb windows coordinates for M. musculus (mm10) for all chromosomes with repeat content per window
│ │ ├── mm10-10kb-seqs.bed 10kb window coordinates and sequences for M. musculus (mm10) for all chromosomes
│ │ ├── mnat-10kb.bed 10kb window coordinates for M. natalensis (mnat) for all chromosomes
│ │ ├── mnat-10kb.bed.unmap 10kb windows that were unable to be lifted over from the mm10 assembly to the M. natalensis (mnat) assembly
│ │ ├── mnat-10kb-seqs.bed 10kb window coordinates and sequences for M. natalensis (mnat) for all chromosomes
│ │ ├── pdel-10kb.bed 10kb window coordinates for P. delectorum (pdel) for all chromosomes
│ │ ├── pdel-10kb.bed.unmap 10kb windows that were unable to be lifted over from the mm10 assembly to the P. delectorum (pdel) assembly
│ │ ├── pdel-10kb-seqs.bed 10kb window coordinates and sequences for P. delectorum (pdel) for all chromosomes
│ │ ├── rdil-10kb.bed 10kb window coordinates for R. dilectus (rdil) for all chromosomes
│ │ ├── rdil-10kb.bed.unmap 10kb windows that were unable to be lifted over from the mm10 assembly to the R. dilectus (rdil) assembly
│ │ ├── rdil-10kb-seqs.bed 10kb window coordinates and sequences for R. dilectus (rdil) for all chromosomes
│ │ ├── rsor-10kb.bed 10kb window coordinates for R. soricoides (rsor) for all chromosomes
│ │ ├── rsor-10kb.bed.unmap 10kb windows that were unable to be lifted over from the mm10 assembly to the R. soricoides (rsor) assembly
│ │ └── rsor-10kb-seqs.bed 10kb window coordinates and sequences for R. soricoides (rsor) for all chromosomes
│ └── tree.tar.gz Trees inferred for each unfiltered 10kb window
│ └── chr*/10kb-0.5-0.5 Directories for each chromosome
│ ├── astral A directory containing files output from ASTRAL relating to the species tree it inferred, including the treefile astral.treefile
│ ├── concat A directory containing files output from IQ-tree relating to the species tree inferred from genes using concatenation, including the treefile concat.treefile
│ ├── loci Individual directories for IQ-Tree output for each gene tree, including the tree file as */*.treefile, where * is the gene id
│ └── loci.treefile A single file containing all 10kb window trees
└── 05-selection-tests Files related to selection test for genes
├── aln.tar.gz Gene alignments
│ ├── 01-pre-trim Files from MACSE's pre-trim step
│ │ ├── aa Trimmed amino acid alignments for each gene in FASTA format
│ │ ├── logs Log files for MACSE's pre-trim program
│ │ ├── nt Trimmed nucleotide alignments for each gene in FASTA format
│ │ └── trim-stats CSV files for each gene showing the trimming stats
│ ├── 02-macse Files from MACSE alignment
│ │ ├── aa MACSE aligned amino acid sequences for each gene in FASTA format
│ │ ├── logs Log files for MACSE
│ │ └── nt MACSE aligned nucleotide sequences for each gene in FASTA format
│ ├── 03-trim Files from MACSE's post-alignment trimming step
│ │ ├── info CSV files for each gene showing the trimming stats
│ │ ├── logs Log files for MACSE's post-alignment trimming
│ │ └── nt Trimmed nucleotide alignments for each gene in FASTA format
│ └── 04-filter-spec4-seq20-site50 Files from subsequent custom alignment filtering
│ │ ├── aln-filter-passed-spec4-seq20-site50.txt A list of gene ids that passed alignment filtering
│ │ ├── aln-stats-spec4-seq20-site50.log A log file for the alignment filtering
│ │ ├── cds A directory containing trimmed nucleotide alignments for each gene in FASTA format
│ │ ├── cds-header-trimmed A directory containing trimmed nucleotide alignments for each gene in FASTA format with their headers trimmed for species tree inference
│ │ ├── gappy-seqs-filtered-spec4-seq20-site50.tab A list of sequences that were filtered for being too gappy
│ │ ├── pep A directory containing trimmed amino acid alignments for each gene in FASTA format
│ │ ├── spec-stats-spec4-seq20-site50.tab Trimming and filtering stats for each individual alignment
│ │ ├── stop-codon-filtered-spec4-seq20-site50.tab A list of sequences that were removed for having a premature stop codong
│ │ ├── too-few-species-filtered-spec4-seq20-site50.tab A list of alignments that were removed for having too few sequences after filtering
│ │ └── too-short-filtered-spec4-seq20-site50.tab A list of alignments that were removed for being too short after filtering
├── bed.tar.gz Coordinates and coding sequences for the genes (longest transcripts)
│ ├── gdol-cds-seqs.tab Gene sequences and coordinates for G. dolichurus (gdol)
│ ├── gdol-cds.bed Gene coordinates for G. dolichurus (gdol)
│ ├── gdol-cds.bed.unmap Gene sequences and coordinates for G. dolichurus (gdol) that could not be lifted over from the mm10 annotation
│ ├── hall-cds-seqs.tab Gene sequences and coordinates for H. alleni (hall)
│ ├── hall-cds.bed Gene coordinates for H. alleni (hall)
│ ├── hall-cds.bed.unmap Gene sequences and coordinates for H. alleni (hall) that could not be lifted over from the mm10 annotation
│ ├── mm10-cds-seqs.tab Gene sequences and coordinates for mm10
│ ├── mm10.ensGene.chromes.longest.bed Coordinates of the longest transcript from the mm10 annotation
│ ├── mm10.ensGene.chromes.longest.cds.bed Coordinates of the longest transcript and CDS exons from the mm10 annotation
│ ├── mnat-cds-seqs.tab Gene sequences and coordinates for M. natalensis (mnat)
│ ├── mnat-cds.bed Gene coordinates for M. natalensis (mnat)
│ ├── mnat-cds.bed.unmap Gene sequences and coordinates for M. natalensis (mnat) that could not be lifted over from the mm10 annotation
│ ├── pdel-cds-seqs.tab Gene sequences and coordinates for P. delectorum (pdel)
│ ├── pdel-cds.bed Gene coordinates for P. delectorum (pdel)
│ ├── pdel-cds.bed.unmap Gene sequences and coordinates for P. delectorum (pdel) that could not be lifted over from the mm10 annotation
│ ├── rdil-cds-seqs.tab Gene sequences and coordinates for R. dilectus (rdil)
│ ├── rdil-cds.bed Gene coordinates for R. dilectus (rdil)
│ ├── rdil-cds.bed.unmap Gene sequences and coordinates for R. dilectus (rdil) that could not be lifted over from the mm10 annotation
│ ├── rsor-cds-seqs.tab Gene sequences and coordinates for R. soricoides (rsor)
│ ├── rsor-cds.bed Gene coordinates for R. soricoides (rsor)
│ └── rsor-cds.bed.unmap Gene sequences and coordinates for R. soricoides (rsor) that could not be lifted over from the mm10 annotation
└── tree.tar.gz Gene trees and species tree inferred from genes
├── astral A directory containing files output from ASTRAL relating to the species tree it inferred, including the treefile astral.treefile
├── concat A directory containing files output from IQ-tree relating to the species tree inferred from genes using concatenation, including the treefile concat.treefile
├── loci Individual directories for IQ-Tree output for each gene tree, including the tree file as */*.treefile, where * is the gene id
└── loci.treefile A file containing all gene trees
Sharing/Access information
Code that was used to process the data is available at: https://github.com/gwct/murine-discordance