Data from: Evolutionary history inferred from the de novo assembly of a non-model organism, the blue-eyed black lemur

Meyer, Wynn K.1; Venkat, Aarti1; Kermany, Amir R.1; van de Geijn, Bryce1; Zhang, Sidi1; Przeworski, Molly1

Published Jul 22, 2015 on Dryad. https://doi.org/10.5061/dryad.rn745

Data files

Jul 22, 2015 version files 3.38 GB

Abstract

Lemurs, the living primates most distantly related to humans, demonstrate incredible diversity in behaviour, life history patterns and adaptive traits. Although many lemur species are endangered within their native Madagascar, there is no high-quality genome assembly from this taxon, limiting population and conservation genetic studies. One critically endangered lemur is the blue-eyed black lemur Eulemur flavifrons. This species is fixed for blue irises, a convergent trait that evolved at least four times in primates and was subject to positive selection in humans, where 5′ regulatory variation of OCA2 explains most of the brown/blue eye colour differences. We built a de novo genome assembly for E. flavifrons, providing the most complete lemur genome to date, and a high confidence consensus sequence for close sister species E. macaco, the (brown-eyed) black lemur. From diversity and divergence patterns across the genomes, we estimated a recent split time of the two species (160 Kya) and temporal fluctuations in effective population sizes that accord with known environmental changes. By looking for regions of unusually low diversity, we identified potential signals of directional selection in E. flavifrons at MITF, a melanocyte development gene that regulates OCA2 and has previously been associated with variation in human iris colour, as well as at several other genes involved in melanin biosynthesis in mammals. Our study thus illustrates how whole-genome sequencing of a few individuals can illuminate the demographic and selection history of nonmodel species.

EfEmSangerSequencingForDryad

Sanger sequencing results for up to 16 individuals (9 blue-eyed black lemurs and 7 black lemurs from the Duke Lemur Center) sequenced at 17 amplicons throughout scaffold2503 (the scaffold containing the orthologs of HERC2/OCAs) and aligned using EBioX

Variable sites VCF files

These files contain sites with a high posterior probability (> 0.8) of being polymorphic in the 8-sample dataset (4 samples from each species), along with their genotype likelihoods in these samples as estimated by ANGSD (http://popgen.dk/wiki/index.php/ANGSD). Genotype likelihoods were estimated using the GATK model (-GL 2). The probability of a site being variable was estimated using ngsStat (https://github.com/mfumagalli/ngsTools), based on posterior probabilities calculated using the genotype likelihoods and the site frequency spectrum estimated from high quality sites (minimum mapping quality 1, minimum quality 20, minimum 3 samples with data, minimum depth 9) within each species.

VariableSitesVCFFiles1.tar.gz

Variable sites VCF files

VariableSitesVCFFiles2.tar.gz

Eulemur flavifrons fastq file for PSMC

This fastq file was generated from the all-sites vcf file for Harlow (the E. flavifrons individual used for the de novo assembly), and was used to infer historic Ne using PSMC (Li and Durbin 2011). The vcf was generated by running GATK (McKenna et al. 2010; DePristo et al. 2011) on the filtered bam file (mapping quality 20 or above, PE reads aligned to the same scaffold with correctly oriented read pairs mapping within three standard deviations of the mean, duplicates removed), with the EMIT_ALL_SITES option and a minimum base quality of 20. The fastq was generated from this all-sites vcf using the vcf2fq function within the vcfutils.pl script, using a minimum depth of 26 and a maximum depth of 104 (0.5 and 2x the mean depth, respectively), and a minimum quality (QUAL*GQ) of 20.

Lemur.allLibs.mergedtoInsert_Sorted_Quake_PutativeSE_K33.gapClosed.scafSeqUG1AllSites.dpfilterL26H104Q20.fq.gz

Eulemur macaco fastq files for PSMC

These two files should be combined into one fastq representing the whole genome. The combined fastq file was generated from the all-sites vcf file for Harmonia (the E. macaco individual used for the de novo assembly), and was used to infer historic Ne using PSMC (Li and Durbin 2011). The vcf was generated by running GATK (McKenna et al. 2010; DePristo et al. 2011) on the filtered bam file (mapping quality 10 or above, duplicates removed), with the EMIT_ALL_SITES option and a minimum base quality of 20. The fastq was generated from this all-sites vcf using a modification of the vcf2fq function within the vcfutils.pl script (vcf2fqnonref, in https://github.com/sorrywm/genome_analysis/vcfutils_mod.pl), using a minimum depth of 10 and a maximum depth of 41 (0.5 and 2x the mean depth, respectively), and a minimum quality (QUAL*GQ) of 20.

MergedBamsFromBWADefaultOneRound_PEandSE_sortedtoInsert_Sorted_Quake_PutativeSE_K33.gapClosed.scafSeqUG1AllSites.dpfilterL10H41Q20.1.fastq.gz

Eulemur macaco fastq files for PSMC

MergedBamsFromBWADefaultOneRound_PEandSE_sortedtoInsert_Sorted_Quake_PutativeSE_K33.gapClosed.scafSeqUG1AllSites.dpfilterL10H41Q20.2.fastq.gz

Dataset S1: Annotated transcripts in the 1% FST tail

This list contains all transcripts mapped to regions in the 1% tail of FST from the full dataset in either species. Columns represent Ensembl transcript ID, region of high FST where the transcript was annotated, start position of the transcript within the region, end position of the transcript within the region, proportion of the transcript mapped to the region, mean percent identity for parts of the transcript that mapped, Ensembl gene ID, and gene symbol. We annotated genes in the 1% tail of all 20 kb non-overlapping windows by aligning human protein sequences to the blue-eyed black lemur genome. We obtained protein sequences for human genome build hg18 and used TBLASTN version 2.2.22+ (Altschul et al. 1990, 1997), with an e-value threshold of 5 x 10-5 to identify orthologs within the regions of the blue-eyed black lemur reference genome corresponding to the 1% FST tail. We then took the list of all human proteins with hits within candidate regions and performed TBLASTN for these proteins against the entire lemur genome. We retained proteins whose best genome-wide match (containing the lowest e-value or maximum mean percent identity) for any subset of the protein sequence overlapped the candidate region. In cases in which multiple proteins mapped to the same location (>50% protein length overlapping, presumably representing multiple transcripts of the same gene or multiple genes in the same family), we retained the protein with the largest total length spanned by initial TBLASTN hits or the largest mean percent identity.

MeyerVenkatEtAlDatasetS1New.txt

Data from: Evolutionary history inferred from the de novo assembly of a non-model organism, the blue-eyed black lemur

Data files

Abstract

Usage notes

EfEmSangerSequencingForDryad

Variable sites VCF files

Variable sites VCF files

Eulemur flavifrons fastq file for PSMC

Eulemur macaco fastq files for PSMC

Eulemur macaco fastq files for PSMC

Dataset S1: Annotated transcripts in the 1% FST tail

Works referencing this dataset