Shared polymorphisms, loci with identical alleles across species, are of unique interest in evolutionary biology as they may represent cases of selection maintaining ancient genetic variation post-speciation, or contemporary selection promoting convergent evolution. In this study, we investigate the abundance of shared polymorphism between two members of the Daphnia pulex species complex. We test whether the presence of shared mutations is consistent with the action of balancing selection or alternative hypotheses such as hybridization, incomplete lineage sorting, or convergent evolution. We analyzed over 2,000 genomes from six taxa in the D. pulex species group and examined the prevalence and distribution of shared alleles between the focal species pair, North American and European D. pulex. We show that North American and European D. pulex diverged over ten million years ago, yet retained tens of thousands of shared polymorphisms. We suggest that the number of shared polymorphisms between North American and European D. pulex cannot be fully explained by hybridization or incomplete lineage sorting alone. We show that most shared polymorphisms could be the product of convergent evolution, that a limited number appear to be old trans-specific polymorphisms, and that balancing selection is affecting convergent and ancient mutations alike. Finally, we provide evidence that a blue wavelength opsin gene with trans-specific polymorphisms has functional effects on behavior and fitness in the wild.
Daphnia Trans-Species Polymorphism Dataset Repository:
- Sample Data:
- Samples metadata used in initial analyses of multi-species dataset: samples.fin.3.9.22.csv
- Samples subset by one individual from each Multi-locus Genotype (MLG), which was used in most analyses as indicated in manuscript/github: samples.fin.9.8.22.csv
- Any instances of "NA" or null values in the metadata are the result of missing data between SRA uploaders.
- These two sample sheets have identical column names (see below), but the samples differ based on clonal subsampling:
| Column Name |
Description |
| Sample |
Sample identifier for each sample in the dataset and includes which multi-locus genotype they belong to. |
| mlg.country |
Sample identifier for each multi-locus genotype. |
| id |
Sample identifier for each sample in the dataset and includes which multi-locus genotype they belong to. |
| pondID |
Identifier for the pond from which the sample was collected. |
| year |
Year the sample was collected. |
| long |
Longitude coordinate for the sample’s origin. |
| lat |
Latitude coordinate for the sample’s origin. |
| SC |
Superclone that each sample belongs to, valid for European Daphnia. |
| Nonindependent |
Indicates whether the sample is a clonal duplicate (0 or 1). |
| WildSequenced |
Indicates whether the sample was isolated from the wild (0 or 1). |
| Sex |
Identifier related to the sex of the sample. |
| Species |
Species for each sample. |
| AxCF1Hybrid |
Indicates whether the sample is a hybrid from the AxC crossing experiment. |
| LabGenerated |
Indicates if the sample was generated from another lab crossing experiment. |
| BioProject |
BioProject ID from the sequence read archive. |
| Continent |
Continent where the sample originated according to the SRA submission portal. |
| Origin |
Additional descriptor of the sample’s origin. |
| Library.Name |
SRA library identification. |
| Sample.Name |
Sample identifier for each sample from the SRA. |
| Isolate |
SRA isolate identification. |
| geo_loc_name_country |
Country as per the SRA submission portal, describing where the sample originated. |
| geo_loc_name_country_continent |
Continent information from the SRA submission portal corresponding to the country. |
| cloneID |
SRA clone ID. |
| dna.seq |
Whether DNA was sequenced from other lab experiments. |
| Country |
Country descriptor from the SRA submission portal. |
| isolate.name |
SRA isolate ID. |
| Reproduction |
Indicates whether the sample reproduces asexually or sexually (cyclic parthenogenic). |
| Monoisolate |
SRA monoisolate ID. |
| PoolSeq |
SRA pool-sequencing ID. |
| IndividualSeq |
SRA individual sequencing ID. |
| Sexual.phenotype |
Indicates if the sample displays ephippia (resting egg cases), which is associated with sexual reproduction. |
| cont |
Abbreviated or alternative continent descriptor for the sample’s origin. |
| country |
Alternative country descriptor for the sample’s origin. |
| mean.depth |
Average genome-wide sequencing depth (x) for each sample. |
| mean.miss |
Average missingness (proportion of missing data) for each sample. |
- Reference allele bias calculations (100 bootstraps) within species with column descriptors below: refallelebias_subsamp_busco1000snps.csv
| Column Name |
Description |
| variant.id |
Unique identifier for each variant. |
| length |
Length of the variant (e.g., number of base pairs or size of the variant region). |
| dos.alt |
Dosage value for the alternate allele. |
| dos.ref |
Dosage value for the reference allele. |
| Species |
Species from which the sample was obtained. |
| Continent |
Continent where the sample was collected. |
| Country |
Country where the sample was collected. |
| Sample |
Identifier for the sample. |
| iteration |
Replicate number in the analysis workflow. |
| af |
Minor allele frequency of the variant. |
| num |
A numeric count or identifier related to the variant (e.g., total occurrence count). |
| af.dos |
Dosage-adjusted minor allele frequency, combining allele dosage information with frequency estimates. |
Single Nucleotide Polymorphisms Datasets and Associated Metadata:
- Filtered and annotated Genomic data structure (GDS): combined.filtsnps10bpindels_snps_filthighmiss.qual.pass.ann.gds
- This is the filtered SNP dataset built from converting the entire multi-species dataset into a GDS object for use in R based analyses and visualization (i.e., PCA, ADMIXTURE, and phylogenetics). The filtering and processing of the VCF is described in detail in the manuscript and the scripts are availible on github for relevant analyses.
library(SeqArray)
out=c("combined.filtsnps10bpindels_snps_filthighmiss.qual.pass.ann.gds")
genofile=seqOpen(out)
- SNP metadata, subset from total_snps and has the 2D site-freqency spectrum information used throughout the project: tot_snpsclassified_0.01thresh.RDS
| Column Name |
Description |
| variant.id |
Unique SNP identifier. |
| classified |
Classification of how the SNP is shared or polymorphic between and within species. |
| col |
Classification of the SNP using the D84A genome. |
| simpleAnnot |
Simplified classification of the SNP using the D84A genome. |
| ch |
Unique identified of the pasted position of the SNP and chromosome. |
| gene |
Gene on which the SNP is located. |
| aa.change |
Amino acid change of the SNP |
| protein |
The protein that the SNP affects. |
| chrom |
Chromosome on which the SNP is located. |
| position |
Position of SNP. |
| Daphnia.pulex.NorthAmerica |
Minor allele frequency (MAF) of the SNP within the NAm. Daphnia pulex. |
| Daphnia.pulex.Europe |
Minor allele frequency (MAF) of the SNP within the European Daphnia pulex. |
- Data used in moments' migration simulation: daphnia.filt.mlg.genome.11.18.22_data_dict.pickle
- This file is the filtered SNP dataset built from converting the entire multi-species dataset into a python pickle object to be used as the basis for our genetic simulations using moments. The use of this object is detailed in github under the moments' analyses and can be loaded into a python session using:
#!/usr/bin/python
import pickle
with open(yd["daphnia.filt.mlg.genome.11.18.22_data_dict.pickle"],"rb") as f:
data_dict=pickle.load(f)
- Metadata objects for every SNP in the multi-species dataset with lightweight consequences annotated: total_snps
| Column Name |
Description |
| variant.id |
Unique identifier for each SNP variant in the dataset. |
| n |
Number of unique annotations for each SNP. |
| col |
Annotation of the SNP. |
| gene |
Name of the gene in which the SNP is located. |
| chrom |
Chromosome on which the SNP is located. |
| position |
Genomic position of the SNP on the specified chromosome. |
- Metadata object for every SNP, but more detailed with annotations that detail specific protein consequences: snps_new
| Column Name |
Description |
| variant.id |
Unique identifier for each SNP variant in the dataset. |
| n |
Number of unique annotations for each SNP. |
| col |
Annotation of the SNP. |
| gene |
Name of the gene in which the SNP is located. |
| aa.change |
Describes the amino acid change caused by the SNP (if applicable). |
| protein |
Identifier or name of the protein affected by the variant (if applicable). |
| chrom |
Chromosome on which the SNP is located. |
| position |
Genomic position of the SNP on the specified chromosome. |
| ch |
Chromosome and position information. |
- Benchmark universal single copy orthologous gene list between European and North American Daphnia pulex: single_copy_orthologs
- BUSCO genes were identified between North American and European Daphnia pulex using the BUSCO software and we only extracted the 'complete' genes.
- Exon list of every exon within the European Daphnia pulex genome (D84A): exon.list
- Daphnia_annotation_PANTHER.zip contains:
- 1) the European D. pulex (D84A) gene metadata and gene ontology (GO) terms (Daphnia_annotation_PANTHER.xls).
| Column Name |
Description |
| qseqid |
Query sequence identifier from the PANTHER input. |
| sseqid |
Subject sequence identifier from the database. |
| pident |
Percentage of identical matches between the query and subject sequences. |
| length |
Length of the alignment between the query and the subject. |
| mismatch |
Number of mismatches in the alignment. |
| gapopen |
Number of gap openings in the alignment. |
| qstart |
Starting position of the alignment in the query sequence. |
| qend |
Ending position of the alignment in the query sequence. |
| sstart |
Starting position of the alignment in the subject sequence. |
| send |
Ending position of the alignment in the subject sequence. |
| evalue |
E-value indicating the statistical significance of the alignment. |
| bitscore |
Bit score representing the quality of the alignment. |
| qcovhsp |
Query coverage per high-scoring pair (HSP). |
| organism |
Organism from which the gene or protein is derived. |
| gene_name |
Gene symbol or common name associated with the sequence. |
| ncbi_taxid |
NCBI Taxonomy ID for the organism. |
| name |
General gene name or alias. |
| id |
Unique identifier for the gene isoform. |
| protein_name |
Name of the protein product encoded by the gene. |
| description |
Functional description of the gene or protein, including known roles or features. |
| PDB |
Protein Data Bank identifier(s) for available protein structures. |
| KEGG |
KEGG pathway identifier(s) associated with the gene or protein. |
| InterPro |
InterPro domain and family information for the gene product. |
| RefSeq |
RefSeq accession number for the gene or protein sequence. |
| eggNOG |
EggNOG orthologous group information. |
| OrthoDB |
OrthoDB orthology group identifier. |
| Pfam |
Pfam domain information describing protein families and domains. |
| PROSITE |
PROSITE pattern identifiers associated with the gene or protein. |
| BioCyc |
BioCyc pathway and gene product information. |
| UniPathway |
UniPathway pathway classification for the gene product. |
| PANTHER |
PANTHER classification information detailing gene families or subfamilies. |
| TIGRFAMs |
TIGRFAMs functional classification identifiers for protein families. |
| GeneID |
Gene identifier from the NCBI Gene database. |
| biological process |
Associated biological process terms (e.g., GO annotations) that describe the functional role of the gene. |
- 2) the European D. pulex GTF (Gene Transfer Format) file that has gene structures, including exon positions and transcript boundaries (Daphnia.aed.0.6.gtf). Includes the standard output from the Maker genome annotation software (https://doi.org/10.1002/0471250953.bi0411s48).
Read Based Phasing Datasets
- WhatsHap phased VCF and index: daphnia.whatshap.ann.vcf.gz, daphnia.whatshap.ann.vcf.gz.tbi
- We generated these files by filtering the entire multi-species dataset and performed phasing using WhatsHap and we used these VCFs as input into our gene-specific phylogenetic tree building pipeline (see cophenetic distances in methods).
- WhatsHap phased BUSCO gene SNP metadata, subset from total_snps for successfully phased BUSCO SNPs: busco_snps_whatshap_phased
| Column Name |
Description |
| variant.id |
Unique identifier for each SNP variant in the dataset. |
| n |
Number of unique annotations for each SNP. |
| col |
Annotation of the SNP. |
| gene |
Name of the gene in which the SNP is located. |
| chrom |
Chromosome on which the SNP is located. |
| position |
Genomic position of the SNP on the specified chromosome. |
- WhatsHap phased BUSCO gene SNP metadata, subset from total_snps for successfully phased BUSCO SNPs: busco_classified_snps_filt.rds
| Column Name |
Description |
| variant.id |
Unique SNP identifier. |
| classified |
Classification of how the SNP is shared or polymorphic between and within species. |
| col |
Classification of the SNP using the D84A genome. |
| simpleAnnot |
Simplified classification of the SNP using the D84A genome. |
| ch |
Unique identified of the pasted position of the SNP and chromosome. |
| gene |
Gene on which the SNP is located. |
| aa.change |
Amino acid change of the SNP |
| protein |
The protein that the SNP affects. |
| chrom |
Chromosome on which the SNP is located. |
| position |
Position of SNP. |
| Daphnia.pulex.NorthAmerica |
Minor allele frequency (MAF) of the SNP within the NAm. Daphnia pulex. |
| Daphnia.pulex.Europe |
Minor allele frequency (MAF) of the SNP within the European Daphnia pulex. |
- WhatsHap phased SNPs metadata: snps_whatshap_phased
| Column Name |
Description |
| variant.id |
Unique identifier for each SNP variant in the dataset. |
| n |
Number of unique annotations for each SNP. |
| col |
Annotation of the SNP. |
| gene |
Name of the gene in which the SNP is located. |
| chrom |
Chromosome on which the SNP is located. |
| position |
Genomic position of the SNP on the specified chromosome. |
Lift Over Datasets
- KAP4 translated to D84A chain file: american_to_european.liftOver.gz
- D84A translated to KAP4 chain file: european_to_american.liftOver.gz
- We can use the liftover chains in the VCF generation pipeline:
module load picard
java "-Xmx${JAVAMEM}" -jar $EBROOTPICARD/picard.jar LiftoverVcf \
I=chrom1.vcf \
O=chrom1.filt.am2eurolift.new.vcf.gz \
CHAIN=american_to_european_chredit.liftOver \
REJECT=chrom1.rejected_variants.vcf \
R=D84A.fa \
WARN_ON_MISSING_CONTIG=true \
RECOVER_SWAPPED_REF_ALT=true
- Metadata of SNPs (i.e., fixed, poly_euro, poly_nam, shared_poly) following the lift over and lists the classifications for each species: unchanged_SNPs_across_assembly.RDS
| Column Name |
Description |
| chrom |
Chromosome on which the SNP is located. |
| position |
Genomic position of the SNP on the specified chromosome. |
| KAP4_class |
Classification of the SNP using the KAP4 genome. |
| ch |
Unique identified of the pasted position of the SNP and chromosome. |
| gene |
Gene on which the SNP is located. |
| simpleAnnot |
A simplified annotation of the SNP. |
| D84A_class |
Classification of the SNP using the D84A genome. |
| SNP_class_ID |
Identifier of whether the classifications between D84A and KAP4 are unchanged. |
- Metadata bed files format of SNPs that were unchanged (i.e., fixed, poly_euro, poly_nam, shared_poly) following the lift over: unchanged_SNPs_across_assembly.bed
| Column Name |
Description |
| chromosome |
Chromosome name. |
| start |
The start position of the SNP. |
| stop |
The end position of the SNP. |
- Metadata of SNPs that were unchanged following the lift over, including protein consequences and subset from snps_new: snps_new_strict, tot_snps_phased
| Column Name |
Description |
| variant.id |
Unique identifier for each SNP variant in the dataset. |
| n |
Number of unique annotations for each SNP. |
| col |
Annotation of the SNP. |
| gene |
Name of the gene in which the SNP is located. |
| aa.change |
Describes the amino acid change caused by the SNP (if applicable). |
| protein |
Identifier or name of the protein affected by the variant (if applicable). |
| chrom |
Chromosome on which the SNP is located. |
| position |
Genomic position of the SNP on the specified chromosome. |
| ch |
Chromosome and position information. |
- Metadata of the BUSCO SNPs that were unchanged following the lift over, including protein consequences and subset from snps_new: busco_snps_new_strict, busco_snps_phased
| Column Name |
Description |
| variant.id |
Unique identifier for each SNP variant in the dataset. |
| n |
Number of unique annotations for each SNP. |
| col |
Annotation of the SNP. |
| gene |
Name of the gene in which the SNP is located. |
| aa.change |
Describes the amino acid change caused by the SNP (if applicable). |
| protein |
Identifier or name of the protein affected by the variant (if applicable). |
| chrom |
Chromosome on which the SNP is located. |
| position |
Genomic position of the SNP on the specified chromosome. |
| ch |
Chromosome and position information. |
- Metadata of SNPs following lift over: snp.dt1.an_remapping_ann.RDS
| Column Name |
Description |
| len |
The number of unique annotations for that SNP. |
| ann |
Complete annotation of the SNP from SNPEff. |
| id |
The position of the SNP. |
| class |
Annotation of the SNP. |
| gene |
Name of the gene in which the SNP is located. |
| protein |
Identifier or name of the protein affected by the variant (if applicable). |
| aa.change |
Describes the amino acid change caused by the SNP (if applicable). |
snp_dt <- readRDS("snp.dt1.an_remapping_ann.RDS")