Skip to main content

Data from: Limited evidence of a genetic basis for sex determination in the common creek chub, Semotilus atromaculatus

Cite this dataset

Meuser, Amanda V.; Pyne, Cassandre B.; Mandeville, Elizabeth G. (2022). Data from: Limited evidence of a genetic basis for sex determination in the common creek chub, Semotilus atromaculatus [Dataset]. Dryad.


Sexual reproduction is almost universal in vertebrates of the animal kingdom; therefore, each species which use it must have a mechanism for designating sex as male or female. Fish especially have a wide range of sex determining systems. In the present study, we aimed to identify a genetic basis for sex determination in the common creek chub (Semotilus atromaculatus) using genotyping-by-sequencing data. No sex-associated markers were found by RADSex or a GWAS using GEMMA, however, our Weir and Cockerham locus-specific FST analysis and discriminant analysis of principal components revealed some genetic differentiation between the sexes at several loci. While no explicit sex determination mechanism has been yet discovered in creek chub, these loci are potential candidates for future studies. Incompatible systems are thought to increase reproductive isolation but interspecific hybridization is common among groups such as cyprinid minnows; thus, studies such as ours can provide insight into hybridization and evolutionary diversification of this clade. We also highlight technical challenges involved in studying sex determination in species with extremely variable mechanisms.


1. Experimental model and subject details

Creek chub fish were collected from 14 agricultural streams across Southern Ontario in the summer of 2019. Fish were euthanized using an overdose of MS-222, then fin tissue was sampled and preserved in 95% ethanol. 79 individuals were used in this study: 38 females and 41 males. Sex was determined through dissection and no malformed or unusual gonads were noted during sampling. All fish were categorized as either male, female, or juvenile, so any fish that were not confidently sexed were likely recorded as juvenile, rather than being mis-sexed. Misidentification is therefore still possible, although perhaps unlikely.

2. DNA isolation and genomic sequencing

We extracted DNA from each of the samples using the QIAGEN DNeasy Blood & Tissue Kit (Qiagen, Inc.), according to the manufacturer’s instructions. Following DNA extraction, we prepared genomic libraries using a reduced-representation genotyping by sequencing method (Parchman et al. 2012) to prepare samples for high-throughput sequencing. DNA was digested with EcoRI and MseI restriction enzymes, then adaptors containing the adaptor sequence, barcode, cut site, and a protector base were ligated onto the fragments, before being amplified with Illumina PCR primers added onto either end (Parchman et al. 2012). After amplification, the DNA samples were sent to the Genomic Sequencing and Analysis Facility at the University of Texas at Austin. There, the samples were size selected for fragments 300-400 base pairs in length using the Sage Science Blue Pippin, then single-end sequencing of 100 base pair fragments was performed with an Illumina NovaSeq 6000. All computing was accomplished via an allocation on Compute Canada’s high-performance computing cluster, Graham, with the exception of data visualization in RStudio on a local computer.

3. Analysis with RADSex

We used several complementary methods to investigate the genetic basis of sex determination in creek chub. We first demultiplexed the raw FASTQ file using sabre ( joshi/sabre), resulting in one FASTQ file for each individual fish. We performed no additional quality control on the reads. While there may be some unintended additional sequences remaining in the data (e.g. primers, adaptors), these reads are not expected to bias our results because they would not align to a fish genome and would not be sex-specific. Initial analysis of sex-associated markers in the creek chub genome was performed using RADSex, a computational workflow intended as an “all-in-one” GSD detection program and visualizer, with the assistance of the RStudio package sgtr (Feron et al. 2021). RADSex treats all markers as unassociated genotypes, rather than alleles of the marker, and searches for markers that are found in one sex exclusively, at a statistically significant level (Feron et al. 2021). Additionally, it does not require a genome for alignment, making it ideal for use with non-model organisms, and does not require imputation before use as it allows for missing data. RADSex takes an input of a FASTQ file of demultiplexed reads for each individual and a file with a list of each individual and their sex. First, the software creates a table containing each marker and its read depth, then the table is filtered such that only markers with a certain read depth are retained (Feron et al. 2021). To fit the low-coverage GBS data used for the present study, we set the depth threshold to one read. Next, RADSex identifies makers that are significantly associated with sex, using Pearson’s chi-squared test of independence (Feron et al. 2021). We visualized the output in RStudio using the radsex distrib function from the accompanying R package, sgtr, which creates a tile plot.

4. Identification of variant sites and filtering of VCF files

As there is not yet a reference genome for creek chub, we constructed an artificial reference genome with CD-HIT (as described in LaCava et al. 2019). We aligned FASTQ files to this reference and sorted into contigs using the Burrows-Wheeler Alignment tool (BWA) (Li & Durbin 2009, version 0.7.17). We identified variable genetic loci using bcftools mpileup (Li & Durbin 2009, version 1.11) and converted the resulting BCF to a VCF file using SAMtools (Li et al. 2009, version 1.12). We then filtered the resulting VCF using VCFtools (Danecek et al. 2011). We produced two different filtered VCFs to fit our analyses. First, we produced a higher-coverage VCF. This VCF originally contained 616,477 markers, but was filtered to require that retained loci have data for 40% or more of individuals and a minor allele frequency of 0.01. Markers were thinned such that only one marker on each contig was kept, so that markers being analyzed were completely unlinked. This filtered VCF file contained 3052 markers, had a mean read depth of 33 reads per marker, and a median read depth of 20 reads per marker. However, the filter initially applied for missing data may have excluded markers that were associated with sex. A sex-associated marker may only be present in about 40% or less of the sample population, as the marker may not be present in one sex or every individual of that sex due to stochasticity in library preparation and sequencing. After re-filtering to adjust the missing data proportion to 0.2 (markers present in only 20% or fewer individuals were excluded) the new VCF file contained 16020 markers, had a mean depth of 11 reads per marker, and a median depth of 8 reads per marker. We also aligned raw data to a draft de novo genome assembly for the northern redbelly dace (Chrosomus eos; Schultz & Mandeville, unpublished data). Although this reference genome is more distantly related to creek chub and we therefore receive a lower alignment rate than aligning to the artificial creek chub reference genome (95% alignment to the artificial reference genome and 73% to the dace genome), aligning to a reference with larger scaffolds might be preferable for this analysis, as it could allow better identification of physically linked loci associated with sex determination (Gopalakrishnan et al. 2017, Feron et al. 2021). We performed the alignment and created a VCF file with the same methodology as above, with a missing data allowance of 40%. This VCF file originally contained 1,233,215 markers, which dropped to 1132 after filtering, and had a mean read depth of 25 and a median read depth of 15.

5. GWAS with GEMMA

As there is some evidence that RADSex not work as well for datasets with lower average sequencing depth (as detailed in the methods paper describing RADSex; Feron et al. 2021), we also applied a number of other analyses to identify potential sex-specific or sex-associated loci. We next performed a GWAS using GEMMA, a single-locus GWAS software (Zhou & Stephens 2012). Essentially, GEMMA treats each marker as a polymorphic allele and searches for two genotypes present exclusively in either sex. GEMMA requires genotype data at all markers, therefore, imputation of missing genotypes must be done beforehand. This increases statistical power compared to eliminating markers without genotypes for all individuals, which would drastically reduce the total number of markers used in the analysis (Guan & Stephens 2008). We used the software entropy for imputation of missing genotypes, as it estimates posterior genotype probabilities at each locus, for each individual (Gompert et al. 2014, Shastry et al. 2021). The phenotype file contains a numeric character for the trait of each individual (Zhou & Stephens 2012). As the trait being examined, sex, is binary, 0 represented male and 1 represented female. The annotation file contained the information for each marker: the marker ID, base position, and contig number, sorted in the same order as the genotype file (Zhou & Stephens 2012). A relatedness matrix is first created that shows a measure of how related each of the individuals are to one another, then this file is used to run the univariate linear mixed model association test (Zhou & Stephens 2012). GEMMA results were visualized in RStudio, using a Manhattan plot function from a tutorial by the Abecasis lab. The threshold for this plot is set at -log10(5e-6). Loci with p-values above this threshold on the Manhattan plot are considered associated with sex at a statistically significant level and could indicate regions of the genome that are part of either a homomorphic sex chromosome or a polygenic sex determination system.

6. Genetic differentiation between sexes

Regions of the genome associated with sex determination are expected to be highly differentiated between males and females of a species relative to other regions of the genome. To complement GWAS analyses, we calculated Weir and Cockerham locus-specific FST using VCFtools (Danecek et al. 2011, Weir & Cockerham 1984). We used Rosner’s test to identify significant outliers from the Weir and Cockerham FST values (Dan & Ijeoma 2013). While the test is innately two-tailed and the outliers being examined only exist in the upper tail, the test is still suitable as any negative values returned will be excluded because they are not biologically relevant. As well, the data set fits the test’s assumption of a normal distribution. The test was performed for all three data sets in RStudio, using the rosnerTest command from the package, EnvStats. The value of alpha (p-value) was set at the default 0.05. 

7. Discriminant analysis of principal components

We completed an additional analysis, a discriminant analysis of principal components (DAPC) (Jombart et al. 2010), through the R package adegenet (Jombart 2008) to identify loci that contribute strongly to genetic differences between male and female individuals (similar to Junker et al. 2020). We first identified sex-associated loci using DAPC, then examined patterns of heterozygosity at the loci most strongly associated with differentiation between sexes.

8. Linkage disequilibrium of FST and DAPC outlier loci

Linkage disequilibrium was quantified by calculating the squared coefficient of correlation (r2) between pairwise comparisons of all markers using VCFtools (Danecek et al. 2011, Pritchard & Przeworski 2001). Using RStudio, we then took subsets of the data that only contained markers deemed outliers from DAPC and the Rosner’s Test on FST values. A Whitney-Mann U test was performed on the data, using the function wilcox.test, to assess whether the means of the subsets were significantly different from those of the entire relevant dataset, therefore undergoing less recombination than the rest of the genome (Mann & Whitney 1947, Wilcoxon 1945). This was done for all 3 VCF files, resulting in 6 calculations.

Usage notes

Below is the amount of markers in each VCF file used for each analysis this study:

16020 - CCCS_cccsref0.9test_miss0.2_mac3_Q20_knownsexinds_thin90_maf001.recode.vcf

3052 - CCCS_cccsref0.9test_miss0.4_mac3_Q30_DP3_ind95_maf001_1snpcontig_sexinds.recode.vcf

1132 - CC_alignedjune2021_miss0.4_mac3_Q20_DP3_thin90_maf001.recode.vcf

fastq_metadata.xlsx is an Excel doc containing metadata about each individual used for this study. 

Manhattan_function.R contains the function that was adapted from the Abecasis lab to create the Manhattan plots, while the manhattanplot_...R files contain the actual script for generating the plot itself. The 1132.R file requires the use of a .csv file with the chromosome lengths from the assembly. Suggested code for creating this .csv file from the assembly.fasta file can be found at the top of the 1132.R file.

venndiagram.R requires FST_DAPC_outliers.txt, which is a tab-delimited file that contains columns of significant markers. Columns must be separated by data set and analysis, ie. fst_16020, dapc_dapc, etc.

The marker_depth.R file uses .idepth files, created by using the --site-mean-depth command in VCFtools, for each VCF file.

dapc.R takes VCF files to calculate Discriminant analysis of principal components (DAPC).

The FST_GEMMA_plots.R and FST_plots_outliers.R files take the .assoc.txt output files from GEMMA and the .weir.fst files from FST calculation as input.


Canada First Research Excellence Fund Food From Thought program