Replicated ecological gradients are prime systems to study processes of molecular evolution underlying ecological divergence. Here, we investigated the repeated adaptation of the neotropical fish Poecilia mexicana to habitats containing toxic hydrogen sulphide (H2S) and compared two population pairs of sulphide-adapted and ancestral fish by sequencing population pools of >200 individuals (Pool-Seq). We inferred the evolutionary processes shaping divergence and tested the hypothesis of increase of parallelism from SNPs to molecular pathways. Coalescence analyses showed that the divergence occurred in the face of substantial bidirectional gene flow. Population divergence involved many short, widely dispersed regions across the genome. Analyses of allele frequency spectra suggest that differentiation at most loci was driven by divergent selection, followed by a selection-mediated reduction of gene flow. Reconstructing allelic state changes suggested that selection acted mainly upon de novo mutations in the sulphide-adapted populations. Using a corrected Jaccard index to quantify parallel evolution, we found a negligible proportion of statistically significant parallel evolution of Jcorr = 0.0032 at the level of SNPs, divergent genome regions (Jcorr = 0.0061) and genes therein (Jcorr = 0.0091). At the level of metabolic pathways, the overlap was Jcorr = 0.2545, indicating increasing parallelism with increasing level of biological integration. The majority of pathways contained positively selected genes in both sulphide populations. Hence, adaptation to sulphidic habitats necessitated adjustments throughout the genome. The largely unique evolutionary trajectories may be explained by a high proportion of de novo mutations driving the divergence. Our findings favour Gould's view that evolution is often the unrepeatable result of stochastic events with highly contingent effects.
SNP FST file
This is the pairwise fst output for individual snps. The non-informative sites have been suppressed, so all the rows are the SNPs that meet the coverage criteria.
the columns are:
col1: scaffold number
col2: SNP position on scaffold
col3: number of snps in window (because the window size was set to 1 to get individual SNP estimates, all values should be 1)
col4: fraction of the window covered (more relevant for sliding window analyses)
col5: mean coverage at SNP over all four populations
col6: pairwise Fst for Tac-C:Tac-S
col7: pairwise Fst for Tac-C:Puy-C
col8: pairwise Fst for Tac-C:Puy-S
col9: pairwise Fst for Tac-S:Puy-C
col10: pairwise Fst for Tac-S:Puy-S
col11: pairwise Fst for Puy-C:Puy-S
population codes: 1=Tac-C, 2=Tac-S, 3=Puy-C, 4=Puy-S
Note that the Fst values that are 0.000000000 are not polymorphic SNPs for a given pairwise comparison. For example, snp 'NW_006799939.1 19911'
has an Fst of 0.00000000 in the Tac-C:Tac-S comparison because it is not polymorphic between Tac-C and Tac-S. From the sync file:
NW_006799939.1 19911 A 21:0:0:0:0:0 21:0:0:0:0:0 39:0:0:2:0:0 37:0:0:5:0:0
you can see that Tac-C and Tac-S are fixed for the A allele (21 counts in each). The only reason it is included as a SNP in the .fst file is because
it is polymorphic in the Puy-C & Puy-S populations (and polmorhic in the comparison across Tac & Puy).
Tac-C_Tac-S_Puy-C_Puy-S.fst
Fisher's exact test data
This is the fisher's exact test output for each SNP, in each pairwise comparison. The same SNP definition ws used as for the .fst output, so these are the
FET results for the same snps included in the Fst output.
The structure of the file is the same as for Tac-C_Tac-S_Puy-C_Puy-S.fst. Instead of Fst values, the numbers are -log10 P-values.
Tac-C_Tac-S_Puy-C_Puy-S.fet
1kb window FST
This file contains the output of the 1000bp sliding window Fst analysis. Structure of the output file is similar to Tac-C_Tac-S_Puy-C_Puy-S.fst:
col1: reference contig (chromosome)
col2: mean position of the sliding window
col3: number of SNPs found in the window (not considering sites with a deletion)
col4: fraction of the window which has a sufficient coverage (min. coverage <= cov <= max. coverage) in every population;
col5: average minimum coverage in all populations
col6: 1:2 the pairwise Fst for population 1 and 2
col7: 1:3 the pairwise Fst for population 1 and 3
....
Tac-C_Tac-S_Puy-C_Puy-S.1000.fst
Puy-C Tajima's D
Tajima's D output for 1000bp windows for Puy-C. There is no option in the script to suppress the non-informative windows, so there are lots of windows with
"na" that fail the coverage criteria so no calculation is made.
Puy-C.D
Puy-S Tajima's D
Tajima's D output for 1000bp windows for Puy-S. There is no option in the script to suppress the non-informative windows, so there are lots of windows with
"na" that fail the coverage criteria so no calculation is made.
Puy-S.D
Tac-C Tajima's D
Tajima's D output for 1000bp windows for Tac-C. There is no option in the script to suppress the non-informative windows, so there are lots of windows with
"na" that fail the coverage criteria so no calculation is made.
Tac-C.D
Tac-S Tajima's D
Tajima's D output for 1000bp windows for Tac-S. There is no option in the script to suppress the non-informative windows, so there are lots of windows with
"na" that fail the coverage criteria so no calculation is made.
Tac-S.D
infilePuy68 for Migrate
Data input file for Migrate-N analysis
infilePuy68.txt
infileTac67 for Migrate-N
Data input file for Migrate-N
infileTac67.txt
parmfile Migrate-N
Parameter file used for both population pairs
parmfile.txt
Allele frequency estimates from PoolSeq analysis
Popoolation output file containing the read count data for polymorphic sites. The first column is the scaffold number (chr), the second the position on the scaffold (pos), the third is the base in the reference genome (rc), number of alleles (allele_count), allelic states (allele_states), number of deletions (deletion_sum), whether the snp is variable among populations or against the reference (snp_type), the major alleles in the populations in the order Tac-C, Tac-S, Puy-C, Puy-S (major_alleles(maa)), the same for minor alleles (minor_alleles(mia)), alleles frequency estimates of the major allele expressed as ratio of reads for the respective population (maa_1, maa_2, maa_3, maa_4), and the same for the minor alleles (mia_1, mia_2, mia_3, mia_4)
Tac-C_Tac-S_Puy-C_Puy-S_rc