Data and scripts for the colour analysis from: Gene flow throughout the evolutionary history of a colour polymorphic and generalist clownfish
Data files
May 27, 2024 version files 7.76 GB
-
aclarkii_final.vcf.gz
-
colour_analysis_data.zip
-
README.md
Abstract
Even seemingly homogeneous on the surface, the oceans display high environmental heterogeneity across space and time. Indeed, different soft barriers structure the marine environment, which offers an appealing opportunity to study various evolutionary processes such as population differentiation and speciation. Here, we focus on Amphiprion clarkii (Actinopterygii; Perciformes), the most widespread of clownfishes that exhibits the highest colour polymorphism. Clownfishes can only disperse during a short pelagic larval phase before their sedentary adult lifestyle, which might limit connectivity among populations, thus facilitating speciation events. Consequently, the taxonomic status of A. clarkii has been under debate. We used whole-genome resequencing data of 67 A. clarkii specimens spread across the Indian and Pacific Oceans to characterise the species' population structure, demographic history, and colour polymorphism. We found that A. clarkii spread from the Indo-Pacific Ocean to the Pacific and Indian Oceans following a stepping-stone dispersal and that gene flow was pervasive throughout its demographic history. Interestingly, colour patterns differed noticeably among the Indonesian populations and the two populations at the extreme of the sampling distribution (i.e. Maldives and New Caledonia), which exhibited more comparable colour patterns despite their geographic and genetic distances. Our study emphasises how whole-genome studies can uncover the intricate evolutionary past of wide-ranging species with diverse phenotypes, shedding light on the complex nature of the species concept paradigm.
README: Colour data and scripts for Amphiprion clarkii
https://doi.org/10.5061/dryad.xwdbrv1nd
The datasets contains the scripts and data for the colour analysis of the related manuscript, as well as the final VCF files used for all the analysis.
Description of the data and file structure
VCF file - aclarkii_final.vcf.gz
Final VCF files used for all the analysis. It contains the genomic information of 67 Amphiprion clarkii specimens from 8 different populations. The following steps were followed to generate this file: we trimmed the generated reads to remove adapter sequences using Cutadapt V2.3 (Martin, 2011). We removed reads shorter than 40 bp and with a Phred quality score below 40 with Sickle V1.33 (Joshi and Fass, 2011). We assessed read quality before and after processing with FastQC V0.11.7 (Andrews, 2010). We mapped the reads that were kept against the reference genome of A. percula(Lehmann et al., 2018) using BWA-MEM V0.7.17 (Li and Durbin, 2009) and subsequently sorted, indexed, and filtered them according to mapping quality (>30) using SAMtools V1.8 (Li et al., 2009). Then, we assigned all the reads to read-groups using Picard Tools V2.20.7 (http://broadinstitute.github.io/picard/ and merged overlapping reads using ATLAS V0.9 (Link et al., 2017). We validated the mapping output using various statistics generated with BamTools V2.4.1 (Barnett et al., 2011). After mapping, we computed genotype likelihoods and estimated the major and minor alleles using the Maximum Likelihood Estimation method with ATLAS V0.9 (Link et al., 2017). We filtered the resulting VCF file with VCFtools V0.1.15 (Danecek et al., 2011) to keep only sites informative for all samples, with a minimum depth of 2 and a minimum Phred quality of 40.
Folder - colour_analysis_data.zip
colour_analysis_data.zip
|- colour_patterns/
| |- RAxML_bestTree.T4_with_root
| |- clarkii_landmarks/
| |- Indonesia/
| |- Maldives/
| |- NewCaledonia/
| |- mask.txt
| |- clarkii_white_balanced_pictures/
| |- Indonesia/
| |- Maldives/
| |- NewCaledonia/
| |- dataset.csv
|- rda/
| |- allchr.ATLAS_majorMinor_clarkii_aki.minQ40minDepth2.minInd.color_indv.no_monophic_sites.cov
| |- allele_count_RDA/
| |- clarkii_country_reorder.pop
| |- dataset.csv
| |- imgPCA12_clarkii.rds
| |- imgPCA34_clarkii.rds
| |- imgPCA56_clarkii.rds
| |- rda_predictors.csv
| |- samples_list_reorder.txt
colour_patterns
- RAxML_bestTree.T4_with_root: phylogenetic tree based on mitochondrial data generated with RaxML
- clarkii_landmarks: contains the landmarks for each individual sample. The folder is organised according to the geographic origins of the sample.
- clarkii_white_balanced_pictures: contains the white-balanced pictures for each individual sample. The folder is organised according to the geographic origins of the sample.
- dataset.csv: information of the samples used for the analysis, displaying the country and location of the sample as well as the picture ID, sequencing ID and sample ID. The sequencing ID corresponds to the following SRA records https://www.ncbi.nlm.nih.gov/sra/PRJNA1025355
rda
- allchr.ATLAS_majorMinor_clarkii_aki.minQ40minDepth2.minInd.color_indv.no_monophic_sites.cov: genomic covariance matrix between all individuals calculated with PCANGS. The order of the individuals corresponds to the file samples_list_reorder.txt.
- allele_count_RDA: contains the allele counts for each chromosome using VCFtools V0.1.14, selecting only a SNP every 1kb. Files with the suffix .012 contain the allele counts, where each line is an individual and each column a SNP. 0 is homozygous for the reference allele, 1 is heterozygous and 2 is homozygous for the alternative allele. Files with suffix .indv contain the information on the order of the individuals in the files with the .012 suffix (correspond to the row name of files 0.12). Files with the suffix .pos contain the position of each SNPs (corresponds to the column name of files .012).
- clarkii_country_reorder.pop: two columns text file containing the sample name (ID) and the localisation (LOC). Locations are New Caledonia (NC), Indonesia - Tulamben (INT), Indonesia - Manado (INM) and Maldives (MDV).
- dataset.csv: information of the samples used for the analysis, displaying the country and location of the sample as well as the picture ID, sequencing ID and sample ID. The LOC column is the abbreviation of the locality name of the samples. The sequencing ID corresponds to the following SRA records https://www.ncbi.nlm.nih.gov/sra/PRJNA1025355
- imgPCA12_clarkii.rds: R object with the saved PCA results for axis 1 and 2 for the colour pattern analysis. More details is found within the R object.
- imgPCA34_clarkii.rds: R object with the saved PCA results for axis 3 and 4 for the colour pattern analysis. More details is found within the R object.
- imgPCA56_clarkii.rds: R object with the saved PCA results for axis 5 and 6 for the colour pattern analysis. More details is found within the R object.
- rda_predictors.csv: text file containing the predictors used for the RDA analysis. ID corresponds to the sequecing ID of the sample, location_id is the short identifier of the sampling locality, location is the numeric code of the locality, picture indicates the related image, latitude and longitude indicate the coordinate of the samples, genPC1 and genPC2 are the PCA values of axis 1 and 2 based on genomic data, and colPC1-colPC6 are the PCA values of axis 1 to 6 based on colour data.
- samples_list_reorder.txt: a list containing the sequence ID of the samples used for the analysis.
Code / Software
Folder - colour_analysis_scripts.zip
This folder contains the R scripts used to run the colour pattern analysis as well as the RDA analysis. R is required to run all the scripts; the script were created using version 4.3.0.
colour_analysis_scripts.zip
|- colour_patterns/
| |- Clarkii_color_pattern.R
| |- Colour_Analyses_Functions.R
|- rda/
| |- RDA_color_analyses_functions.R
| |- RDA_color_analyses.R
colour_patterns
The script Colour_Analyses_Functions.R contains the functions called in the script Clarkii_color_pattern.R. The functions are described within the script. The script Clarkii_color_pattern.R is annoted and consists in the following steps (1) procust analysis, (2) transform resolution, (3) image classification (4) PCA and (5) colour adjacency analysis.
rda
The script RDA_color_analyses_functions.R contains the functions called in the script RDA_color_analyses.R. The functions are described within the script. The script RDA_color_analyses.R is annoted and consists in the following steps (1) loading data, (2) RDA analysis by chromosome, (3) RDA analysis on all chromosomes, (3) identification of significant SNPs and (4) plotting.