Data from: Island size shapes genomic diversity in a great speciator (Aves: Zosterops)
Data files
Mar 05, 2025 version files 18.58 KB
-
README.md
9.95 KB
-
zosterops_genomes_github_copy_noCode.zip
8.63 KB
Abstract
Islands have long represented natural laboratories for studying many aspects of ecology and evolutionary biology, from speciation to community assembly. One aspect that has been well documented is the correlation between island size and taxonomic diversity, likely due to decreased complexity and population size on small islands. This same logic can apply to genetic diversity, which should predictably decrease with effective population size. The island size-diversity correlation has received support over the years, but often focuses on simple metrics of genetic diversity. Here, we use Zosterops white-eyes in the Solomon Islands to study the correlation between island size and various metrics related to genetic diversity, including runs of homozygosity and fixation of transposable elements. We find that almost all of these metrics strongly correlate with island size, and in turn with each other. We infer that island size is independently correlated with these different variables, demonstrating that population size impacts genomic metrics of diversity in a variety of ways across temporal and hierarchical scales.
https://doi.org/10.5061/dryad.z8w9ghxqf
This repository contains the directory structure and select input files used to generate estimates of nucleotide diversity and other metrics of genomic variation for correlation with island size from whole genome resequencing of the Solomon Islands white-eye radiation. Scripts are deposited on the associated Zenodo repository: https://datadryad.org/downloads/zenodo_file/3812095.
Please note that the directory structure is based on the GitHub, and in most cases needs the Zenodo scripts to function. README corresponds to the full repository, not just the Dryad components.
Description of the data and file structure
Run steps in directories to get from raw data to output data tables and some of the figures (or partial figures later edited in Illustrator).
The Zenodo upload contains code and additional files, dryad just the additional files. All can be found here: https://github.com/jdmanthey/zosterops_genomes.
Directories are listed below, with summaries of each script within. File included in the dryad upload are marked with an asterisk, the rest are in the Zenodo upload and Github.
01_prep_reference = modify the published Zosterops lateralis genome for use in this project, including annotation of TEs. Numbered files were run as individual steps on a high performance computing center (HPCC), and include comments on when to run the R code.
- 01_satsuma - Bash code for using Satsuma to scaffold the fragmented Z. lateralis genome to the zebra finch chromosomal reference. Includes description of the next steps not in the code.
- rename_filter_satsuma_genome.r - Run after prior script, R code used to rename the Satsuma reference.
- 02_repeatmodeler - Bash code for running RepeatModeler.
- 03_refine_repeatmodeler_output - Bash code for pulling repeat sequences from reference.
- zosterops_filter_repeatmodeler_blast.r - R code for filtering repeat information.
- 04_repeatmasker - One line of bash code for running RepeatMasker.
- 05_index_reference - Bash code for indexing reference.
02_qual_stats = run fastqc and summarize with multiqc on all raw data (run on HPCC)
- 01_quality_check - Short bash script for getting quality information for the raw reads.
03_trim_process_genotype = quality trim with bbduk, samtools to convert to bam, and then GATK for bam processing and genotyping. Contains a popmap as well. Scripts are run on HPCC, and 04 is a note about how to use the R script to make a submission script.
- 01_rename_samples - Bash script for renaming reads.
- rename_genomes_a - Input for renaming reads in prior script.*
- 02_clean_process - Batch script for using BWA to align reads and Samtools/GATK to clean the output alignment files.
- 03_create_genotype_scripts - R script to make batch scripts for genotyping reads with GATK.
- 04_note - Note on the usage of prior script.
- popmap.txt - Simple population map, but just mapping samples to an arbitrary number.*
04_depth = get alignment depth statistics and plots (depth run on HPCC)
- 01_calc_depth_readme.txt - Note on when and how to calculate and visual depth statistics.
- 02_samtools_depth.txt - Batch script for calculating per-sample depth.
- 03_cut.sh - Batch script for splitting up depth files from prior script.
- popmap.txt - Simple population map, but just mapping samples to an arbitrary number (same as prior folder).*
- 04_plot_coverage.r - R script for plotting depth information.
05_MELT = MELT workflow to call polymorphic transposable elements, contains a README that describes the workflow in depth.
- 01_readme - Note on how to run MELT work analyses.*
- 02_make_refs - Bash code for making MELT reference files.
- 03_process_bams - Bash code for processing bam files for MELT.
- 04_cluster_scripts.r - R script for matching batch scripts for running MELT.
- make_fake_gene_bed.r - R script for making a gene file to be used by MELT (run prior to 04).
- 05_process_MELT_output.r - R script for making final output file.
05_process_vcf = filter all vcf files, make fasta files for phylogenomics, window calculations, summarize diversity and ROH. Scripts are numbered in order of use on HPCC.
- 01_initial_filter.sh - Batch script for performing simple initial filters with VCFtools.
- 02_variant_filter.sh - Batch script for filtering variants with a few different schemes.
- 03_zip_index.sh - Batch script to bgzip and tabix index each VCF.
- 04_divide_to_windows.r - R script to make batch scripts for making windows.
- 04b_make_header.sh - Shell script to run one line of code to output a VCF header.
- 05_setup - File describing how to set up popmap for later scripts, both for population designation and sample number (latter in alphabetical order, as in VCF)
- 06_calc_windows.sh - Batch script to run calculate_windows.r script.
- 07_combine_window_calcs.sh - Bash script to combine windowed stats.
- 08_simplify_vcf.sh - Script to simplify the VCFs for later analyses.
- 09_summarize_diversity.r - R script to summarize diversity statistics.
- 10_summarize_roh.r - R script to output ROH summary stats.
- calculate_windows.r - R script to calculate stats for windows using code in window_stat_calculations.r.
- vcf_list.txt - List of VCFs per chromosome.
- window_stat_calculations.r - R script with code to calculate diversity statistics for windows.
- zost_ref.fai - FASTA index file, outlining the length and start positions of each chromosome.*
06_phylo = phylogenomics of windowed fasta files. Scripts are numbered in order of use on HPCC.
- 01_setup.sh - Bash script to get directory set up.
- 02_raxml_array.sh - Batch script for running RAxML as a job array.
- 03_combine_root_trees.r - Bash script to run R for combining tree files into one file.
- 04_summarize_trees.sh - Shell script to generate ASTRAL and maximum clade credibility trees.
- 07_compare_trees.r - R script to compare the summary tree to gene trees and ASTRAL trees.
07_demography = demographic analyses in MSMC, plotting, and calculating pop. sizes. Scripts are numbered out of order but are unmodified for psoterity, seed README within directory for more information.
- 01_msmc_input_files.r - R script to make input files for MSMC.
- 03_msmc_submit_script.r - R script to make batch scripts for running MSMC.
- 02_create_bootstraps.sh - Shell script for making MSMC bootstrap replicates.
- 04_plot_demography.r - R script to plot MSMC output.
- Readme.md - Note on how to run the code (see above, 01 -> 03 -> 02).
08_island_size = summarize and plot relationships of island size and demography and genomic diversity. Output is made from prior steps, but note that this is not the input used for final figures (although relevant values are identical).
- summary.txt - Original R input, not used for final paper, see zosterops_genomes_appendix_2024.csv (values are the same, see below for details).
- plot_summary.r - R script for making base R plots of diversity data, replaced by 09_plotting_correlations_2024.r for final paper.
09_plotting_correlations_2024.r = updated plotting and analyses in 2024 (by Ethan Gyllenhaal). Script is designed to be run interactively and locally.
zosterops_genomes_appendix_2024.csv - File used for final stats and plots correlating diversity and island parameters. Columns are as follows:
- Order - Index variable.
- Species - Species name for sample, using the same taxonomy as the paper (Howard & Moore).
- Museum # - Museum identifier for the sample, first with museum code than a unique number.
- Sample ID - Additional museum identifier, when different it is a tissue number or preparator number.
- Island - Island sample is from, if from the Solomons.
- Island Group - Which of four major island groups the sample was from.
- Island Size (km2) - Island size in km^2.
- Island Size (log) - Natural log of prior column, value used as predictor variable.
- # ROH - Number of runs of homozygosity.
- Length ROH - Summed length of runs of homozygosity.
- MeanLength - Mean length of ROH (i.e., length / #)
- obs_het - Observed heterozygosity in the sample.
- het_sd - Standard deviation in heterozygosity across the genome.
- NR TEs - Number of transposable elements.
- NR Homozygous TEs - Number of homozygous non-reference transposable elements.
- MSMC Recent Pop Size - Recent population size from MSMC analysis.
- MSMC Harmonic Mean Pop Size - Harmonic mean population size from MSMC analysis.
- Raw PE Reads - Number of raw reads for the sample.
- Mean Genome Cov. - Mean depth of coverage per sample.
Files and variables
File: zosterops_genomes_github_copy_noCode.zip
Description: Zipped directory structure of analyses without scripts included. Directories are described above. This also include a reduced README outlining the directory structure for the full repository (i.e., including the code on Zenodo). There is also a .csv here, which is used as input for 09_plotting_correlations_2024.r. It includes relevant variables for modeling in the ingroup, and a small subset of information for the outgroup (not used in modeling).
Code/software
Code is contained in a zipped directory structure of analyses with scripts included. See "Description of the data and file structure" for summary of the code, more complicated steps have further instructions in their respective folders.
Access information
Other publicly accessible locations of the data:
Data was derived from the following sources:
- All samples were sequenced for this project. All raw data is available on NCBI’s sequence read archive (SRA) under BioProject ID: PRJNA686795.
The dataset is whole-genome resquencing data at moderate depth of Zosterops white-eyes from the Solomon Islands, with one sample per island available. Variants were called on the data to generate input files for a variety of programs, where multiple statistics relating to genetic diversity were calculated were calculated