Data from: The neurodevelopmental genes alan shepard and Neuroglian contribute to female mate preference in African Drosophila melanogaster
Data files
May 08, 2024 version files 59.02 MB
-
cline_snp.zip
5.54 MB
-
clines.zip
2.66 MB
-
GO_enrichment.zip
464.10 KB
-
New_popgen_analysis.zip
50.34 MB
-
Nrg_analysis.R
4.26 KB
-
README.md
8.05 KB
-
shep_analysis.R
1.94 KB
May 30, 2024 version files 58.68 MB
-
cline_snp.zip
5.54 MB
-
clines.zip
2.48 MB
-
GO_enrichment.zip
307.57 KB
-
New_popgen_analysis.zip
50.34 MB
-
Nrg_analysis.R
4.26 KB
-
README.md
8.05 KB
-
shep_analysis.R
1.94 KB
Abstract
Mate choice is a key trait that determines fitness for most sexually reproducing organisms, with females often being the choosy sex. Female preference often results in strong selection on male traits that can drive rapid divergence of traits and preferences between lineages, leading to reproductive isolation. Despite this fundamental property of female mate choice, very few loci have been identified that contribute to mate choice and reproductive isolation. We used a combination of population genetics, quantitative complementation tests, and behavioral assays to demonstrate that alan shepard and Neuroglian contribute to female mate choice, and could contribute to partial reproductive isolation between populations of Drosophila melanogaster. Our study is among the first to identify genes that contribute to female mate preference in this historically important system, where female preference is an active premating barrier to reproduction. The identification of loci that are primarily known for their roles in neurodevelopment provides intriguing questions of how female mate preference evolves in populations via changes in sensory system and higher learning brain centers.
https://doi.org/10.5061/dryad.cvdncjtbg
Description of the data and file structure
In this experiment, we wanted to find genes important for female mating behavior that contributes to female preference between populations of Drosophila melanogaster.
Cline Analysis
We first started by conducting a population genetic analysis using clinal analysis from publicly available data. The files and scripts used in this analysis are contained in the two folders with contents described here
Folder: cline_snp
This folder contains the initial analysis of the clines including the following files
snp_freq.R This file splits the raw genotype data into subfiles. This was a necessary step to make sure we could analyze the clines in parallel. We focused on separating snps that had fixed differences vs segregating differences.
The following files summarized the SNP cline results. The difference between cline_fix_sum v2 and the original file is the addition of a column to sort by the total number of populations that contained a SNP. These three files have the same variables
The first column is an identifier for each row. Chromsome refers to the chromosome in D melanogaster genome. SNP position is the location on the chromosome with the Dmel v 5 reference genome. In further analyses we converted coordinates to identify gene locations. The Freq1 through Freq 12 columns are the allele frequencies across the cline. The Slope is the regression coefficient from the linear model, and the pvalue contains the p-value for this slope
cline_fix_sum v2.csv
cline_fix_sum.csv
cline_seg_sum.csv
The next set of files is organized with the input csv the R code used to calculate clines and then the output csv. The variables are the same as the descriptions above.
V1 is the chromosome in D melanogaster genome. V2 is the SNP position is the location on the chromosome with the Dmel v 5 reference genome. In further analyses we converted coordinates to identify gene locations. The remaining columns are the allele frequencies across the cline.
They were split into separate files to ease in parallel processing. Each R file is the same just calling a different input and writing to a unique output.
fixed1.csv
snp_cline_fix1.R
snp_clines_fix1.csv
fixed2.csv
snp_cline_fix2.R
snp_clines_fix2.csv
fixed3.csv
snp_cline_fix3.R
snp_clines_fix3.csv
segregating1.csv
snp_cline_seg1.R
snp_clines_seg1.csv
segregating2.csv
snp_cline_seg2.R
snp_clines_seg2.csv
segregating3.csv
snp_cline_seg3.R
snp_clines_seg3.csv
Folder: clines
This folder contained summaries and downstream analyses and comparisons that we conducted to identify candidate genes
FlyBase_Converted_Coordinates.txt This file is the output from flybase.org. We fed the coordinator converter tool our SNP coordinates from the r5 genome and asked they be converted to the r6 genome. This was necessary to find SNPs located in genes
FlyBase_Fields_download.txt This is the output from the query tool from flybase.org. We took our outlier SNPs and got output for biological function for the identified genes
cline_fix_sum.csv This file summarizes the clines for SNPs that were fixed in the northern and southern extremes. Variables are the same as files described above
cline_seg_sum.csv This is a cline summary for segregating SNPS
clinelist.csv This is a csv for the list of outlier genes that we used in comparisons with the Bailey et al 2011 data.
filtered_snps.csv This file contains the new coordinates from Flybase concatenated with the rest of the SNP cline data
genes_with_slopes.csv This file contains a summary of the SNPs that were within genes. The first three columns are the output text from Flybase. The first column is the chromosome location. The second column is the nucleotide position for the beginning of the gene, the third column "..." is the separator that Flybase uses. The fourth column is the nucleotide position for the end of the gene. The fifth column concatenates the previous 4 columns so they are in a single cell. The sixth column is the Dmel gene name. The 7th column is gene function for genes that had matched queries on flybase. The 8th and 9th columns are the position of the specific SNP within the gene. The 10th and 11th column are the slope and p-value from the cline analysis
Any gene names in red are genes with potential roles in behavior through their biological function classification on FlyBase
intersectlists.R This code compares the list of outliers from our study with those from Bailey et al 2011
intersect output.txt This is the output of shared genes
plos_one_list.csv This is the list of genes from Bailey et al 2011 that were identified as differentially expressed that we compared our outliers with
ploslist.csv This is the list in a csv format to use in the intersectlists.R code
pop latitudes.txt This file contains the latitudes from each population that were used in calculating the clies
pop.txt This file lists the names of samples from the Kao et al 2015 dataset that we used in our analysis
GO outlier analysis
After conducting our cline analysis we wanted to determine if there were any gene ontology (GO) categories that were overrepresented in our outliers. The results of this analysis are included in the folder: GO enrichment
genes_with_slopes.csv This file is the same as above. It contains the gene names that we fed into FlyEnricher to complete our analysis
The resulting output from FlyEnricher was exported in these *.txt files. The excel file was made from the txt file for one so that we could read the output easier
GO_Biological_Process_2018_table.txt
GO BIO Process 2018 table.csv
GO_Biological_Process_GeneRIF_table.txt
InterPro_Domains_2019_table.txt
Phenotype_GeneRIF_table.txt
Population Genetic Analysis for Nrg and Shep
Looking at our GO enrichment and outlier analyses we decided to focus on Neuroglian (Nrg) and alan shepard (shep) for our genetic crosses. To gain more insight into these genes we looked at population genetics from available data collected as part of the Drosophila nexus. Data was originally downloaded from PopFly for the regions of interest Results from this analysis and code are included in the folder: New popgen analysis.
The files are as follows
nrg_meta.csv These are all of the samples included for the Nrg analysis
Nrg_noN.fasta These are fasta file for chromosomes after removing those that are all N
Nrg_sites.fasta These are the polymorphic sites used in the PCA analysis
shep_meta.csv These are all of the samples included in the analysis for Shep
output_Chr3L_5124380_5293734.fastashep_noN.fasta This is the original data downloaded from PopFly for shep which is on chromosome 3L
output_ChrX_8403859_8456776.fastashep_sites.fasta This is the original data downloaded from PopFly for Nrg which is on the X chromsome
pca analysis.R This is the code we used for our PCA analysis
The following files are pi and Fst exports for the genes at different window sizes. RAL is the Raleigh population and ZI is the Zimbabwe population
RAL_Pi_1kb-3L-5124427..5293781.gff3
RAL_Pi_1kb-X-8403859..8456275.gff3
ZI_Pi_1kb-3L-5124427..5293781.gff3
ZI_Pi_1kb-X-8403859..8456275.gff3
ZI_RAL_fst_1kb-3L-5124427..5293781.gff3
ZI_RAL_fst_1kb-X-8403859..8456275.gff3
Quantitative complementation tests for Nrg and shep
Lastly we completed a test similar to a quantitative complementation test and recorded the choice of females from different genotypes. The analysis of these data are in the following R files.
Nrg analysis.R This is an annotated R file that describes the complete analyses of the data for Nrg
shep analysis.R This file is structured similarly to the Nrg file and contains our analysis of shep data.
Code/Software
All code used in the analyses are described above and are annotated within the code
