Data from: Estimating genome-wide phylogenies using probabilistic topic modeling
Data files
Mar 04, 2025 version files 478.39 MB
-
README.md
1.80 KB
-
two_birds_Pacbio.tar.gz
478.39 MB
Abstract
Inferring the evolutionary history of species or populations with genome-wide data is gaining ground, but computational constraints still limit our abilities in this area. We developed an alignment-free method to infer the genome-wide species tree and implemented it in the Python package TopicContml. The method uses probabilistic topic modeling (specifically, Latent Dirichlet Allocation or LDA) to extract ‘topic’ frequencies from k-mers, which are derived from multilocus DNA sequences. These extracted frequencies then serve as an input for the program Contml in the PHYLIP package, which is used to generate a species tree. We evaluated the performance of our method with biological and simulated data sets: a data set with 14 DNA sequence loci from 78-92 haplotypes from two Australian bird species distributed in 9 populations; a second data set of 5162 loci from 80 mammal species; and a third data set of 67317 autosomal loci and 4157 X-chromosome loci of 6 species in the Anopheles gambiae complex, and several simulated data sets. Our empirical results and simulated data suggest that our method is efficient and statistically accurate. We also assessed the uncertainty of the estimated relationships among clades using a bootstrap procedure for aligned sequence data and for k-mer data.
https://doi.org/10.5061/dryad.73n5tb36r
Description of the data and file structure
The data consists of two fasta.gz files, which are files containing sequence data from each species. These files were converted from fastq.gz files, which is the form in which the original data was produced after post-processing from the PacBio HiFi sequencer. Fasta.gz is a standard format for DNA sequence data produced from genomes.
Files and variables
File: two_birds_Pacbio.tar.gz
Description: The file consists of two tarred and gzipped files, as follows:
- Ctat_100k_seqtk_Pacbio_reads.fasta.gz -100,000 PacBio HiFi reads from Crypturellus tataupa
- Tgut_100k_seqtk_Pacbio_reads.fasta.gz - 100,000 PacBio HiFi reads from Tinamus guttatus
The file was compiled with this command line:
tar cvf two_birds_Pacbio.tar.gz Ctat_100k_seqtk_Pacbio_reads.fasta.gz Tgut_100k_seqtk_Pacbio_reads.fasta.gz
As in a typical fasta file, each distinct sequence begins with a ">" flag. The name of each sequence is arbitrary and set by the DNA sequencing machine.
Code/software
The file can be uncompressed using standard unix bash scripts. Reads were counted and measured using functions in seqkit (Shen W, Le S, Li Y, Hu F (2016) SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLOS ONE 11(10): e0163962. https://doi.org/10.1371/journal.pone.0163962). The procedures for further analyzing the data are described in the associated publication.
Access information
Other publicly accessible locations of the data:
- N/A
Data was derived from the following sources:
- N/A
PacBio HiFi reads were generated from DNA isolated from four birds: a Woodhouse's Scrub-Jay (Aphelocoma woodhouseii, specimen MCZ:Orn:365326); a Yucatan Jay (Cyanocorax yucatanicus, specimen MCZ:Orn:365269); Tataupa Tinamou (Crypturellus tataupa); and White-throated Tinamou (Tinamus guttatus). DNA was isolated using Qiagen Magattract, and the concentration was estimated using a TapeStation. The DNA of the two tinamous was sequenced on a PacBio HiFi Revio DNA sequencer at the Faculty of Arts & Sciences Bauer Core, Harvard University, and the DNA from the two jays was sequenced on a PacBio HiFi machine at the DNA Sequencing & Genotyping Center at the University of Delaware.
