Data from: Same trait, different genes: pelvic spine loss in three brook stickleback populations in Alberta
Data files
Nov 21, 2024 version files 28.55 GB
-
ast_chr19_outlier_region.vcf
312.92 KB
-
ast_ID_filename_pelvicpheno.csv
3.78 KB
-
AST_ninespine_geno20_maf01.mlma
77.79 MB
-
AST_snps_geno20_maf01_chr19.bed
18.24 KB
-
AST_snps_geno20_maf01_chr19.bim
17.02 KB
-
AST_snps_geno20_maf01_chr19.fam
3.69 KB
-
AST_snps_geno20_maf01_pruned.bed
14.36 MB
-
AST_snps_geno20_maf01_pruned.bim
13.38 MB
-
AST_snps_geno20_maf01_pruned.fam
3.69 KB
-
AST_snps_geno20_maf01_pruned.grm.bin
28.56 KB
-
AST_snps_geno20_maf01_pruned.grm.id
2.62 KB
-
AST_snps_geno20_maf01_pruned.grm.N.bin
28.56 KB
-
AST_snps_geno20_maf01.bed
37.52 MB
-
AST_snps_geno20_maf01.bim
35.05 MB
-
AST_snps_geno20_maf01.fam
3.69 KB
-
AST_snps.phen
2.97 KB
-
AST_snps.vcf
10.60 GB
-
chr19.range.file.txt
28 B
-
chr3.range.file.txt
29 B
-
mui_chr19_outlier_region.vcf
269.10 KB
-
mui_ID_filename_pelvicpheno.csv
6.61 KB
-
MUI_ninespine_geno20_maf01.mlma
81.96 MB
-
MUI_snps_geno20_maf01_chr19.bed
15.65 KB
-
MUI_snps_geno20_maf01_chr19.bim
17.53 KB
-
MUI_snps_geno20_maf01_chr19.fam
2.59 KB
-
MUI_snps_geno20_maf01_pruned.bed
10.61 MB
-
MUI_snps_geno20_maf01_pruned.bim
11.87 MB
-
MUI_snps_geno20_maf01_pruned.fam
2.59 KB
-
MUI_snps_geno20_maf01_pruned.grm.bin
19.40 KB
-
MUI_snps_geno20_maf01_pruned.grm.id
1.71 KB
-
MUI_snps_geno20_maf01_pruned.grm.N.bin
19.40 KB
-
MUI_snps_geno20_maf01.bed
33.20 MB
-
MUI_snps_geno20_maf01.bim
37.22 MB
-
MUI_snps_geno20_maf01.fam
2.59 KB
-
MUI_snps.phen
1.91 KB
-
MUI_snps.vcf
8.79 GB
-
README.md
8.71 KB
-
shu_chr3_outlier_region.vcf
125.43 KB
-
shu_ID_filename_pelvicpheno.csv
7.41 KB
-
SHU_ninespine_geno20_maf01.mlma
72.74 MB
-
SHU_snps_geno20_maf01_chr3.bed
7.33 KB
-
SHU_snps_geno20_maf01_chr3.bim
8.20 KB
-
SHU_snps_geno20_maf01_chr3.fam
2.64 KB
-
SHU_snps_geno20_maf01_pruned.bed
9.55 MB
-
SHU_snps_geno20_maf01_pruned.bim
10.68 MB
-
SHU_snps_geno20_maf01_pruned.fam
2.64 KB
-
SHU_snps_geno20_maf01_pruned.grm.bin
19.01 KB
-
SHU_snps_geno20_maf01_pruned.grm.id
1.76 KB
-
SHU_snps_geno20_maf01_pruned.grm.N.bin
19.01 KB
-
SHU_snps_geno20_maf01.bed
29.16 MB
-
SHU_snps_geno20_maf01.bim
32.71 MB
-
SHU_snps_geno20_maf01.fam
2.64 KB
-
SHU_snps.phen
2.05 KB
-
SHU_snps.vcf
8.65 GB
-
Spine_Asym.csv
29.04 KB
Abstract
The genetic basis of phenotypic or adaptive parallelism can reveal much about constraints on evolution. This study investigated the genetic basis of a canonically parallel trait: pelvic spine reduction in sticklebacks. Pelvic reduction has a highly parallel genetic basis in threespine stickleback in populations around the world, always involving a deletion of the pel1 enhancer of Pitx1. In three populations of brook stickleback in Alberta, Canada, pelvic reduction did not involve Pitx1. Instead, pelvic reduction in one population involved a mutation in an exon of Tbx4, and it involved a mutation in an intron of Lmbr1 in the other two populations. Hence, the parallel phenotypic evolution of pelvic spine reduction across stickleback genera, and among brook stickleback populations, has a non-parallel genetic basis. This suggests that there is redundancy in the genetic basis of this adaptive polymorphism, but it is not clear whether this indicates a lack of constraint on the evolution of this adaptive trait. Whether the different pleiotropic effects of different mutations have different fitness consequences, or whether certain pelvic reduction mutations confer specific benefits in certain environments, remains to be determined.
README: Same trait, different genes: pelvic spine loss in three brook stickleback populations in Alberta
Description of the data and file structure
The raw DNA sequence data is not included here, but is accessible on the Short Read Archive (SRA) database: accession numbers PRJNA895038 (Astotin Lake), PRJNA838068 (Muir Lake), and PRJNA838194 (Shunda Lake).
The raw sequence data was used to call SNPs in each lake using the SNP-calling pipeline available at https://github.com/jon-mee/culaea_wgs_SNPs
The SNP data for all three lakes is provided here in vcf format and Plink binary format. There are three versions of the SNP data: 1) not filtered for minimum allele frequency (MAF), 2) filtered for MAF, and 3) LD-pruned. There are also files containing a genetic relatedness matrix (GRM) for each lake.
For each lake, there is also a vcf (variant call format) file and Plink binary files with just the SNPs in the outlier region (these are much smaller files).
File Details
Details for Astotin Lake data
* AST_snps.vcf : not filtered for MAF
* AST_snps_geno20_maf01_pruned.bed, AST_snps_geno20_maf01_pruned.bim, AST_snps_geno20_maf01_pruned.fam: Plink binary format, LD-pruned
* AST_snps_geno20_maf01_pruned.grm.bin, AST_snps_geno20_maf01_pruned.grm.id, AST_snps_geno20_maf01_pruned.grm.N.bin : GRM files
* AST_snps_geno20_maf01.bed, AST_snps_geno20_maf01.bim, AST_snps_geno20_maf01.fam : Plink binary format, filtered for MAF
* AST_snps.phen : first two columns contain sample names, third column contains pelvic phenotype coded as 1 = spined, 2 = intermediate, 3 = unspined
* ast_ID_filename_pelvicpheno.csv : sample names and pelvic phenotypes in CSV format
* AST_ninespine_geno20_maf01.mlma : GWAS results
* Variables
* 'Chr' : chromosome
* 'SNP' : SNP
* 'bp' : physical position
* 'A1' : reference allele (the coded effect allele)
* 'A2' : the other allele
* 'Freq' : frequency of the reference allele
* 'b' : the additive effect (fixed effect) of the candidate SNP
* 'se' : standard error
* 'p' : p-value
* ast_chr19_outlier_region.vcf : subset of filtered SNPs in outlier region
* AST_snps_geno20_maf01_chr19.bed, AST_snps_geno20_maf01_chr19.bim, AST_snps_geno20_maf01_chr19.fam : subset of filtered SNPs in outlier region, Plink binary format
* chr19.range.file.txt : small file listing the chromosome number and physical coordinates of the edges of the outlier region
Details for Muir Lake data
* MUI_snps.vcf : not filtered for MAF
* MUI_snps_geno20_maf01_pruned.bed, MUI_snps_geno20_maf01_pruned.bim, MUI_snps_geno20_maf01_pruned.fam: Plink binary format, LD-pruned
* MUI_snps_geno20_maf01_pruned.grm.bin, MUI_snps_geno20_maf01_pruned.grm.id, MUI_snps_geno20_maf01_pruned.grm.N.bin : GRM files
* MUI_snps_geno20_maf01.bed, MUI_snps_geno20_maf01.bim, MUI_snps_geno20_maf01.fam : Plink binary format, filtered for MAF
* MUI_snps.phen : first two columns contain sample names, third column contains pelvic phenotype coded as 1 = spined, 2 = intermediate, 3 = unspined
* mui_ID_filename_pelvicpheno.csv : sample names and pelvic phenotypes in CSV format
* MUI_ninespine_geno20_maf01.mlma : GWAS results
* Variables
* 'Chr' : chromosome
* 'SNP' : SNP
* 'bp' : physical position
* 'A1' : reference allele (the coded effect allele)
* 'A2' : the other allele
* 'Freq' : frequency of the reference allele
* 'b' : the additive effect (fixed effect) of the candidate SNP
* 'se' : standard error
* 'p' : p-value
* mui_chr19_outlier_region.vcf : subset of filtered SNPs in outlier region
* MUI_snps_geno20_maf01_chr19.bed, MUI_snps_geno20_maf01_chr19.bim, MUI_snps_geno20_maf01_chr19.fam : subset of filtered SNPs in outlier region, Plink binary format
* chr19.range.file.txt : small file listing the chromosome number and physical coordinates of the edges of the outlier region
Details for Shunda Lake data
* SHU_snps.vcf : not filtered for MAF
* SHU_snps_geno20_maf01_pruned.bed, SHU_snps_geno20_maf01_pruned.bim, SHU_snps_geno20_maf01_pruned.fam: Plink binary format, LD-pruned
* SHU_snps_geno20_maf01_pruned.grm.bin, SHU_snps_geno20_maf01_pruned.grm.id, SHU_snps_geno20_maf01_pruned.grm.N.bin : GRM files
* SHU_snps_geno20_maf01.bed, SHU_snps_geno20_maf01.bim, SHU_snps_geno20_maf01.fam : Plink binary format, filtered for MAF
* SHU_snps.phen : first two columns contain sample names, third column contains pelvic phenotype coded as 1 = spined, 2 = intermediate, 3 = unspined
* shu_ID_filename_pelvicpheno.csv : sample names and pelvic phenotypes in CSV format
* SHU_ninespine_geno20_maf01.mlma : GWAS results
* Variables
* 'Chr' : chromosome
* 'SNP' : SNP
* 'bp' : physical position
* 'A1' : reference allele (the coded effect allele)
* 'A2' : the other allele
* 'Freq' : frequency of the reference allele
* 'b' : the additive effect (fixed effect) of the candidate SNP
* 'se' : standard error
* 'p' : p-value
* shu_chr19_outlier_region.vcf : subset of filtered SNPs in outlier region
* SHU_snps_geno20_maf01_chr19.bed, SHU_snps_geno20_maf01_chr19.bim, SHU_snps_geno20_maf01_chr19.fam : subset of filtered SNPs in outlier region, Plink binary format
* chr3.range.file.txt : small file listing the chromosome number and physical coordinates of the edges of the outlier region
Details for asymmetry data
* Spine_Asym.csv
* Variables
* Fish ID : Sample name
* GTsamplename : Sample name
* Lake : Population name
* Year: Year sampled
* Sex_Field : Sex identified in the field (M = male, F = female, U = unknown)
* Pelvic.pheno : pelvic phenotype
* Gene : gene associated with SNP outlier
* SNP_genotype : genotype at the outlier SNP
* geno.num : genotype converted to a number (1 = homozygote, 2 = heterozygotę, 3 = homozygotę)
* Plink_code_phenotype : phenotype converted to a number (1 = spined, 2 = intermediate, 3 = unspined)
* Girdle_L_length : left pelvic girdle length
* Girdle_L_width : left pelvic girdle width
* Girdle_L_area : left pelvic girdle area
* Girdle_R_length : right pelvic girdle length
* Girdle_R_width : right pelvic girdle width
* Girdle_R_area : right pelvic girdle area
* Girdle_process : length of the pelvic girdle process (extending posterior from the girdle)
* Spine_L_length : left spine length
* Spine_R_length : right spine length
* Girdle_asym : asymmetry measure for pelvic girdle area
* Spine_asym: asymmetry measure for spines
* Measured_by : the undergraduate student RA who performed the measurements
* Units : the units for the area and length measurements
* Missing data and missing information are indicated by cells containing "NA". Some spine and pelvic girdle lengths could not be measured because the structure didn't exist (e.g. in an intermediate individual). For samples measured by C. Ly, girdle length and girdle process length were not measured. If individuals were not genotyped, there is an NA for the GTsamplename, the SNP_genotype, and the geno.number fields.
Code/Software
I was not able to use the Genome-wide Complex Trait Analysis (GCTA) software package to run an MLMA analysis on the HPC cluster I was using. To run the GWAS with GCTA in on a Mac computer:
- Download GCTA from https://yanglab.westlake.edu.cn/software/gcta/#Download
- Make your Plink-formatted filtered SNPS, the GRM from the LD-pruned set of SNPs, as well as a file containing phenotype information (e.g. a .phen file) are all in the same directory.
- Open Terminal.
- Navigate to the directory containing your SNPs.
- Run the following line of code (e.g. to analyze the Astotin Lake SNPs):
~/gcta_1.92.2beta_mac/bin/gcta64 --mlma --bfile AST_snps_geno20_maf01 --grm AST_snps_geno20_maf01_pruned --pheno AST_snps.phen --out AST_ninespine_geno20_maf01
R scripts to create each of the panels in Figure 2:
* Manhattan_Plot_AST_ninespine.R
* Manhattan_Plot_MUI_ninespine.R
* Manhattan_Plot_SHU_ninespine.R
Bash scripts (written to run on a SLURM-based HPC) to make Plink and vcf formatted subsets of the SNP data for calculation of LD in the outlier regions:
* AST_LD_analysis.sh
* MUI_LD_analysis.sh
* SHU_LD_analysis.sh
R script to plot and analyse the genotype-phenotype association for the outlier SNPs (for Figure 3):
* geno-pheno_association.R
R script to make asymmetry plot (Figure 4):
* Spine_asym.R
Bash script (written to run on a SLURM-based HPC) to analyse Tajima's D in windows (note that this is based on the non-MAF-filtered SNP data):
* 02_Tajima.sh
R script to make Figure 5:
* Tajima.R