Skip to main content
Dryad

Data from: Large-scale mutation in the evolution of a gene complex for cryptic coloration

Cite this dataset

Gompert, Zachariah et al. (2020). Data from: Large-scale mutation in the evolution of a gene complex for cryptic coloration [Dataset]. Dryad. https://doi.org/10.5061/dryad.pk0p2ngkf

Abstract

The types of mutations affecting adaptation in the wild are only beginning to be understood. In particular, whether structural changes shape adaptation by suppressing recombination or by creating new mutations is unresolved. Here we show that multiple, linked but recombining loci underlie cryptic color morphs of Timema chumash stick insects. In a related species, these loci are found in a region of suppressed recombination, forming a supergene. However, in seven species of Timema we find that a mega-base size ‘supermutation’ has deleted color loci in green morphs. Moreover, we find that balancing selection likely contributes more to maintaining this mutation than does introgression. Our results show how suppressed recombination and large-scale mutation can help package gene complexes into discrete units of diversity, such as morphs, ecotypes, or species.

Methods

This repository contains a compilation of data, functions and scripts used in: Romain Villoutreix, Clarissa F. de Carvalho, Víctor Soria-Carrasco, Dorothea Lindtke, Marisol De-la-Mora, Moritz Muschick, Jeffrey L. Feder, Thomas L. Parchman, Zach Gompert, and Patrik Nosil. (2020) Large-scale mutation in the evolution of a gene complex for cryptic coloration. Science. More detailed information can be found in the folders for each particular analysis in the Online Supplementary Materials.

Usage notes

color_data.tar.gz:  This compressed directory contains the raw phenotypic data extracted from photographs (latRG and latGB) along with information on sequenced individuals. For technical reasons (photographs missing or failed sequencing) the list of phenotypes and sequenced inviduals differed explaining the slight discrepancy in numbers between these files and the tables in the Online Supplementary Materials. On file per species timema_cristinae_1.3c2_braker_interproscan_predgene_and_funcann.wseq.gff3.bz2 =  the annotation file, in gff3 format, obtained using braker and interprot matches and intermediate data for annotation repeats library - RepeatLibMergeCentroidsRM.lib masked genome - tcristinae_draft_1.3b2.fasta.masked transcriptome data - Iowa.ALL454Reads.fq.bz2 RNA alignments - outAligned.sortedByCoord.out.bam

genome_scripts.tar.gz:  This compressed directory contains functions and scripts used to conduct analysis. It contains the following subfolders:

  • annotation - This directory contains the scripts used to annotate the genome with braker and interproscan
    • repeatmodeler.sh - Runs RepeatModeler to create a library of de novo TEs
    • merge_libraries.sh - Uses vsearch to merge the repeat modeler library with the curated TE database from Soria-Carrasco et al (2014)
    • repeatmasker.sh - Masks repeats in the genome by running RepeatMasker
    • star.sh -> Aligns 454 RNA reads to genome using STAR
    • braker.sh -> Uses braker to annotate structural genes using RNA alignments
    • interproscan.sh -> Functional annotation with interproscan. This is script is quite specific for the particular SGE cluster of the University of Sheffield and InterproScan installation
    • add_interpro_annotations.sh -> Add functional annotations to braker gff3 structural annotations reorientate_1.3b2_to_1.3c2.pl -> Reorientate scaffolds 702.1 and 2963 a posteriori
  • epistasis - This directory contains scripts to run epistasis analysis and plot graphics.
    • boxplot2.R - Plot each SNP genotype against their phenotypic score (RG or GB values) for all the samples
    • get_list_snps_latRG.sh - Computes a “melanic allele dosage” score for all individuals with the provided SNPs list (10 in the study). Here 0 = individual is homozygote for green allele at all 10 SNPs, and 20 = individual is homozygote for the melanistic allele at all 10 SNPs.
    •  meldosage_latRG.R - Plots RG or GB scores against the “melanic allele dosage”, with each insects dots represented with their real color (recorded from photograph values).
    • Ordernadoselecting_topSNPs_scaf128.sh - select the 10 SNPs from scaffold 128 with highest association score from GEMMA (bslmm analysis on color).
    • runMAPIT.R - This is a wrapper to run MAPIT analyses (in combination with runMAPITExhaustiveSearch.R)
    • runMAPITExhaustiveSearch.R - This is a wrapper to run MAPIT analyses (in combination with runMAPIT.R)
    • mapit_chumashRG.sh - script used to run mapit (using the runMAPIT.R and runMAPITExhaustiveSearch.R wrapers)
  • gwas - This directory contains a set of scripts to run GWA analyses with GenABEL and gemma.  Further information can be found in the readme file inside the subdirectory.
  • PCA - Contains the R function used to run pcr on genotypes.
    • PCA_genotypes.R - compute a PCA on genotype file in a bimbam format, and colour individuals based on grouping factor provided.
  • phasing - Contains the script used to phase Timema chumash and Timema bartmani gbs dataphasing.sh - Bash script running fastPHASE software
  • genotype calling - Contains the functions used to filter and QC GBS sequence data, align to a reference sequence, build a consensus sequence for each species, call variants, and infer genotypes to be used for GWAS analyses along with the phenotypes. See readme file inside the folder for more information.
  • sequencing_coverage - contains functions used to compute depth statistic along the reference genome.
    • samtools_pooled_depth.pl - compute the statistics
    • samtools_pooled_depth_plots.pl -  plot the statistics
    • samtools_pooled_depth_plots_per_scaffold.pl - plot the statistics for a given scaffold
    • samtools_pooled_depth_plots_zoom.pl - plot the statistics for a given region within a scaffold

indel.tar.gz:  This compressed directory contains scripts and input data used to fit hidden Markov models to detect the insertion/deletion

mapping.tar.gz: This compressed directory contains input and output for GWA mapping of color that was used for plots

  •     latRG = input and output for lateral RG color
  •     latGB = input and output for lateral GB color

PacStruct.tar.gz:  This compressed directory contains scripts, inputs and outputs for the structure model used to identify and visualize haplotype blocks

phylo_anc.tar.gz:  This compressed directory contains scripts and input files for ancestral reconstruction of the deletion.

Funding

European Research Council, Award: NatHisGen R/129639