Data from: Ancient and modern genomes reveal microsatellites maintain a dynamic equilibrium through deep time

McComish, Bennet 1 ; Charleston, Michael1; Parks, Matthew2; Baroni, Carlo3; Salvatore, Maria Cristina3; Li, Ruiqiang4; Zhang, Guojie5; Millar, Craig6; Holland, Barbara1; Lambert, David2

Published Feb 09, 2024 on Dryad. https://doi.org/10.5061/dryad.7gt3rg2

Abstract

Microsatellites are widely used in population genetics, but their evolutionary dynamics remain poorly understood. It is unclear whether microsatellite loci drift in length over time. This is important because the mutation processes that underlie these important genetic markers are central to the evolutionary models that employ microsatellites. We identify more than 27 million microsatellites using a novel and unique dataset of modern and ancient Adélie penguin genomes along with data from 63 published chordate genomes. We investigate microsatellite evolutionary dynamics over two time scales: one based on Adélie penguin samples dating to approximately 46.5 kya, the other dating to the diversification of chordates more than 500 Mya. We show that the process of microsatellite allele length evolution is at dynamic equilibrium; while there is length polymorphism among individuals, the length distribution for a given locus remains stable. Many microsatellites persist over very long time scales, particularly in exons and regulatory sequences. These often retain length variability, suggesting that they may play a role in maintaining phenotypic variation within populations.

https://doi.org/10.5061/dryad.7gt3rg2

This file explains the code and data that accompany McComish et al (2024) Ancient and modern genomes reveal microsatellites maintain a dynamic equilibrium through deep time. For further information, please contact Bennet McComish at bennet.mccomish@utas.edu.au.

Each step in the analysis is described below, with the names of relevant scripts and archives of output files. Please note that many of the scripts contain hard-coded file paths and will need to be modified to run on a different system. In some cases, example scripts for one species or sample are given, and scripts for other species/samples can be generated by substituting species/sample codes and names as appropriate. Scripts are bundled together in the file code.tar.gz.

Reference genomes, annotations, and pairwise alignments can be found at http://gigadb.org/dataset/101000 (for the 48 bird species) and http://hgdownload.cse.ucsc.edu/downloads.html (for the other 15 vertebrates). Adélie penguin sample reads can be found at https://www.ncbi.nlm.nih.gov/bioproject/210803.

Species codes used in file names follow the convention used in the bird dataset, and are formed by combining the first three letters of the genus and the first two letters of the species name, e.g. PYGAD for Pygoscelis adeliae, HOMSA for Homo sapiens.

1. Generate tables of microsatellite loci for each species

For each whole genome, Tandem Repeats Finder (TRF) is run with permissive parameters. The resulting DAT files are converted to region files for input to RepeatSeq (only used in Adélie penguin samples, see step 16 below) and to BED format (https://genome.ucsc.edu/FAQ/FAQformat.html#format1), combined, filtered into a subset for each period 2-6 with its alignment score threshold, and converted to GFF (https://genome.ucsc.edu/FAQ/FAQformat.html#format3), using awk. Subsets are labelled as p<period>s<score threshold>.

Example bash script:

trf.HOMSA.bash

Input:

A reference genome for each species in (gzipped) FASTA format

Output:

BED files: trf_bed.tar.gz contains five gzipped files in BED format for each of 63 vertebrate species, specifying the genomic locations of microsatellites of periods 2 to 6. File names are of the form <species code>*trf*<subset>.bed.gz
GFF files: trf_gff.tar.gz contains five gzipped files in GFF format for each of 63 vertebrate species, specifying the genomic locations of microsatellites of periods 2 to 6. File names are of the form <species code>*trf*<subset>.gff.gz
Adélie penguin region files: adelie_trf_regions.tar.gz contains five files specifying the genomic locations of microsatellites of periods 2 to 6 in the Adélie penguin reference genome in the format required by RepeatSeq (see https://github.com/adaptivegenome/repeatseq). File names are of the form PYGAD_trf_<subset>.regions

2. Get list of contig lengths for each reference genome

For each whole genome, the script seq_length.py from https://github.com/nextgenusfs/genome_scripts is run to produce a list of contigs with their sequence lengths.

Input:

A reference genome for each species in FASTA format

Output:

A text file <species code>.lengths for each species

3. Add flanking regions to GFF files for use in step 6

R script:

gff_add_flanks.R

Adds 20bp flanking regions to all loci in GFF files from step 1.

Input:

GFF files from step 1
Contig length files from step 2

Output:

Five files in GFF format for each of 63 vertebrate species, specifying the genomic locations of microsatellites of periods 2 to 6. File names are of the form <species code>_trf_<subset>_flanks20.gff.gz

4. Extract CNEE regions from alignments

Python script:

maffilter_run_CNEE_sp.py

Uses MafFilter to extract coordinates for regions in each bird species that align to chicken CNEEs (conserved nonexonic elements – putative regulatory regions). These are then converted to GFF format using an awk command and concatenated into a single file per bird species.

Input:

Pairwise alignments between chicken and each other bird species in MAF format (from http://gigadb.org/dataset/101000)
GFF files specifying locations of CNEEs in each chromosome of the chicken genome

Output:

A file in GFF format for each bird species specifying the genomic locations corresponding to the CNEEs identified in the chicken genome by Lowe et al. (2015). File names are of the form <species code>_CNEE.gff

5. Sort bird microsatellite GFF files by genomic location

Bash scripts:

gff_gene_sort.bash
gff_CDS_sort.bash
gff_CNEE_sort.bash

Extract genes (mRNA) and CDS annotations from bird reference genome annotation GFFs, and intersect these (and the CNEEs from step 4) with microsatellite GFFs from step 1 to sort the microsatellite loci by genomic location.

Input:

Microsatellite GFF files from step 1
Annotation GFF files for each bird species (from http://gigadb.org/dataset/101000)
CNEE GFF files for each bird species from step 4

Output:

trf_gene.tar.gz contains five gzipped files in GFF format for each of 63 vertebrate species, specifying the genomic locations of microsatellites of periods 2 to 6 that overlap gene annotations. File names are of the form <species code>_trf_<subset>_gene.gff.gz
trf_CDS.tar.gz contains five gzipped files in GFF format for each of 63 vertebrate species, specifying the genomic locations of microsatellites of periods 2 to 6 that overlap coding sequence annotations. File names are of the form <species code>_trf_<subset>_CDS.gff.gz
trf_CNEE.tar.gz contains five gzipped files in GFF format for each of 63 vertebrate species, specifying the genomic locations of microsatellites of periods 2 to 6 that align to chicken CNEEs (putative regulatory regions). File names are of the form <species code>_trf_<subset>_CNEE.gff.gz

6. Extract chicken genome coordinates of alignable microsatellite loci

Python script:

maffilter_run.py

Takes positions of microsatellite loci in GFF format (with 20bp flanks) and extracts corresponding positions from the alignment with the chicken genome. For a given species and microsatellite period, generates parameter files for MafFilter and runs for each chromosome of the chicken genome (because there is a MAF alignment file for each chromosome), then combines the results into a single file.

Input:

Pairwise alignments between chicken and each other species in MAF format (bird alignments from http://gigadb.org/dataset/101000; others from http://hgdownload.soe.ucsc.edu/downloads.html#chicken)
GFF files from step 3

Output:

A single file for each species and microsatellite period named <species code>_<subset>_all_coordinates.txt

R script:

maffilter_process.R

For a given species, chicken chromosome, and microsatellite period, takes TRF results in BED format (from step 1) and MafFilter results. For each locus in the MafFilter results, identifies the corresponding locus in the TRF results. Standardises loci to + strand coordinates and reverse complements motifs where necessary. Standardises motifs to lexographically minimal string rotation.

Output:

For each species and microsatellite period, TRF results for alignable loci in the species of interest plus corresponding coordinates in chicken genome. File names are of the form <species code>_<subset>_<chromosome>_chicken.txt. Output file columns are:
- locus: ID for each microsatellite locus
- contig: contig in the species of interest on which the locus is found
- start: start position
- end: end position
- period: period of the microsatellite
- match: percent of matches between adjacent copies (from TRF)
- score: alignment score (from TRF)
- motif
- chicken_chr: chromosome in the chicken genome to which the locus aligns
- chicken_strand: strand to which the locus aligns
- chicken_start: start position of locus in the chicken genome
- chicken_end: end position of locus in the chicken genome
- chicken_motif: motif in the chicken genome (reverse-complement if on the - strand)

7. Combine chicken genome coordinates of alignable microsatellite loci for all species

R script:

combine_loci.R

For each microsatellite period and chicken chromosome, combines tables of alignable microsatellites from all species generated in step 6 (chicken genome coordinates only) and merges or deletes duplicate loci. Sorts by start position.

Input:

Tables generated in step 6

Output:

locus_tables.tar.gz contains five text files for each chromosome of the chicken genome, specifying the genomic locations in the chicken genome of microsatellite loci of periods 2 to 6 that are alignable to the chicken reference genome, from 63 vertebrate species. File names are of the form loci_<subset>_<chromosome>.txt

8. Calculate similarity scores for nearby motifs

Java script:

MotifRead.java

For each microsatellite locus, calculates similarity scores for all loci (in any genome) with start positions within 60bp (in chicken genome coordinates). Outputs scores in ABC format for input to MCL (https://micans.org/mcl/man/mcl.html).

Input:

Locus tables generated in step 7

Output:

abc_files.tar.gz contains five ABC format text files for each chromosome of the chicken genome, specifying similarity scores for pairs of microsatellite loci of periods 2 to 6 that are alignable to the chicken reference genome, from 63 vertebrate species. File names are of the form loci_<subset>_<chromosome>.abc

9. Run MCL

For each chromosome and period, run MCL on similarity scores calculated in step 8 to generate clusters of putative homologous loci.

Input:

ABC format files generated in step 8

Output:

Five MCL format cluster files for each chromosome of the chicken genome. File names are of the form clusters_<subset>_<chromosome>.mcl

10. Convert clusters to presence/absence matrix

Java script:

ClustersToMatrix.java

Sorts loci into a table with a row for each cluster (including singleton loci which didn’t cluster) and a column for each species. Output can be binary (1 = presence / 0 = absence) or contain locus IDs for each microsatellite present.

Input:

For each species and chicken chromosome, a table of microsatellites from all species (from step 7)
MCL cluster files generated in step 9

Output:

matrix_bin.tar.gz contains five text files for each chromosome of the chicken genome, containing presence/absence (0/1) data for microsatellite loci of periods 2 to 6 that are alignable to the chicken reference genome, from 63 vertebrate species. File names are of the form matrix_bin_<subset>_<chromosome>.txt
matrix_id.tar.gz contains five text files for each chromosome of the chicken genome, containing microsatellite locus IDs, where present, for loci of periods 2 to 6 that are alignable to the chicken reference genome, from 63 vertebrate species. File names are of the form matrix_bin_<subset>_<chromosome>.txt

11. Identify regions covered in pairwise alignments

Bash script:

cov_loop.bash

Loops through all pairwise alignments, extracting start and end positions of every block of the chicken genome covered in the alignment.

Input:

Pairwise alignments between chicken and each other species in MAF format

Output:

For each of 63 vertebrate species, a text file per chicken chromosome with two columns giving start position and length of blocks covered in the pairwise alignment between that species and the chicken genome. File names are of the form <chromosome>.GALGA.<species code>.cov.txt

12. Identify ambiguous bases in genome assemblies

Python script:

fasta_find_N.py

Finds coordinates of all ambiguous bases (i.e. any character other than A, C, G, or T) in a fasta file and outputs as GFF. Run for each reference genome fasta file.

Input:

A reference genome for each species in FASTA format

Output:

A gzipped file in GFF format for each species specifying all genomic locations that contain ambiguous bases. File names are of the form <species code>_n.gff.gz

13. Get chicken genome coordinates corresponding to runs of ambiguous bases in each reference genome

Python script:

maffilter_run_N.py

Takes positions of ambiguous bases in GFF format and extracts corresponding positions from the alignment with the chicken genome. For a given species, generates parameter files for MafFilter and runs for each chromosome of the chicken genome (because there is a MAF alignment file for each chromosome).

Input:

Pairwise alignments between chicken and each other species in MAF format
GFF files from step 12

Output:

A single file for each species and chicken chromosome named <species code>_<chromosome>_N_coordinates.txt

R script:

maffilter_process_N.R

For a given species and chicken chromosome, takes MafFilter results with positions of ambiguous bases. Standardises coordinates to + strand where necessary. Outputs a table with start positions and lengths of runs of ambiguous bases aligned to chicken genome.

Output:

For each of 63 vertebrate species, a text file per chicken chromosome with two columns giving start position and length of blocks in the pairwise alignment between that species and the chicken genome that consist of ambiguous bases in the species reference genome. File names are of the form <chromosome>.GALGA.<species code>.ns.txt

14. Re-code loci with missing data

R script:

matrix_missing.R

For each species without a microsatellite at a given locus, checks whether the locus is fully covered by non-ambiguous bases in the pairwise alignment with chicken. If not, changes the entry in the binary presence/absence matrix to ‘?’ for missing data.

Input:

Presence/absence matrices from step 10
Coverage files from step 11
Files from step 13 specifying locations of ambiguous bases

Output:

matrix_missing.tar.gz contains five text files for each chromosome of the chicken genome, containing presence/absence/missingness (0/1/?) data for microsatellite loci of periods 2 to 6 that are alignable to the chicken reference genome, from 63 vertebrate species. File names are of the form matrix_gap_<subset>_<chromosome>.txt

15. Infer ages of loci

R script:

mrca_all.R

For each microsatellite period and chicken chromosome, reads in the binary presence/absence matrix and a dated phylogenetic tree. Finds the most recent common ancestor of the species with a microsatellite present for each locus, and adds the age of that ancestral node to the matrix.

Input:

Tree file Chronogram_all.clades.tre
Presence/absence matrices from step 10

Output:

matrix_age.tar.gz contains five text files for each chromosome of the chicken genome, containing presence/absence (0/1) data for microsatellite loci of periods 2 to 6 that are alignable to the chicken reference genome, from 63 vertebrate species, and a column with the age of the most recent common ancestor for each locus. File names are of the form matrix_age_<subset>_<chromosome>.txt

16. Add locus ages to TRF output

R script:

trf_age.R

Takes inferred locus ages from step 15 and adds them to the TRF output for each species.

Input:

BED files with details of microsatellite loci for each species and subset from step 1
Presence/absence matrices with locus ages from step 15
Matrices with microsatellite locus IDs from step 10

Output:

trf_age.tar.gz contains five gzipped files in BED format for each of 63 vertebrate species, specifying the genomic locations of microsatellites of periods 2 to 6, with an additional column containing the inferred age for each locus. File names are of the form <species code>_trf_<subset>_age.bed.gz

17. Ancestral state reconstruction

Code:

ancestral.R

A collection of R code (not a script!) used to carry out ancestral state reconstruction, count changes on edges, plot various graphs. Sections of code were modified as required and run on subsets of the data for each microsatellite period. Key outputs are Adélie penguin TRF tables with minimum locus ages added (most recent gain on the lineage leading to Adélie penguin).

Input:

Tree file Chronogram_all.clades.tre
Presence/absence/missingness matrices from step 14
Adélie penguin BED files from step 16

Output:

PYGAD_trf_ages.tar.gz contains five files in BED format specifying the genomic locations of microsatellites of periods 2 to 6 in the Adélie penguin reference genome, with additional columns containing ages inferred from the most recent common ancestor and from the most recent gain on the lineage leading to Adélie for each locus. File names are of the form PYGAD_trf_<subset>_ages.bed.gz

18. Align Adélie penguin sample reads to reference genome

For each sample (modern or ancient), reads are aligned to the Adélie reference genome using bowtie2 with the --very-sensitive option. Alignments are converted to BAM, sorted and indexed using samtools.

Example bash script:

bowtie2.AP1.PYGAD.bash

Input:

Paired FASTQ files for each Adélie penguin sample (available from https://www.ncbi.nlm.nih.gov/bioproject/210803)
Bowtie2 index files for the Adélie penguin reference genome

Output:

An indexed sorted BAM file for each Adélie penguin sample. File names are of the form <sample>.PYGAD.sorted.bam, <sample>.PYGAD.sorted.bam.bai

19. Genotype microsatellite loci in Adélie penguin samples

For each sample (modern or ancient), RepeatSeq is run on the BAM alignment, with a region file for each subset of microsatellite loci in the Adélie genome from step 1.

Example bash script:

repeatseq.mod.AP1.p2s22.bash

Input:

Sorted BAM files from step 18
Adélie penguin reference genome in FASTA format
Adélie penguin region files from step 1

Output:

RepeatSeq output in VCF, .repeatseq, and .calls formats (see https://github.com/adaptivegenome/repeatseq) for each Adélie sample

20. Convert calls to tables

Bash script:

calls_table.bash

Converts .calls files to tables using sed and awk, for input into R.

Input:

RepeatSeq .calls output from step 19

Output:

adelie_repeatseq_modern.tar.gz and adelie_repeatseq_ancient.tar.gz contain gzipped RepeatSeq calls tables for the five microsatellite subsets for each modern and ancient Adélie penguin sample, respectively. Columns are:
- contig
- start
- end
- motif_length
- motif
- n1_<sample>
- n2_<sample>
- qual_<sample>
where n1 and n2 are the genotype calls for the two alleles present in the sample and qual is the phred-scaled quality score. File names are of the form <sample>_repeatseq_<subset>_scaf.calls.table

21. Combine genotypes and locus info and calculate statistics

R scripts:

process_repeatseq_<subset>.R

For each period 2-6, read in genotypes for all Adélie samples and combine with locus information including age estimates. Calculate various statistics, categorise loci by genomic location, and produce tables of genotypes for model testing.

Input:

RepeatSeq calls tables from step 20
Adélie penguin TRF BED files with locus ages from step 17
Adélie penguin TRF BED files from step 1 (includes loci that do not align to the chicken genome and are therefore not present in the BED files with locus ages)
Adélie penguin GFF files from step 5 specifying which microsatellite loci are found in genes, coding sequence, and putative regulatory regions

Output:

repeatseq_genotypes.tar.gz contains five .Rdata files (one for each microsatellite subset) with tables of RepeatSeq genotype calls, split by genomic location and microsatellite purity
repeatseq_data.tar.gz contains five .Rdata files (one for each microsatellite subset) with tables of RepeatSeq call statistics, split by genomic location and microsatellite purity

22. Test sample age models using BayesFactor

R scripts:

BF_locus_sampleage_<subset>.R

For each period 2-6, run generalTestBF from the R package BayesFactor to test models including sample age as an explanatory variable on samples of 2000 loci. Sample from the posterior distributions.

Input:

Genotype call .Rdata files from step 21

Output:

BF_locus_sampleage.tar.gz contains five .Rdata files (one for each microsatellite subset) with samples of loci tested, BayesFactor objects, and posterior samples

23. Test locus age models using BayesFactor

R scripts:

BF_<subset>.R

For each period 2-6, run generalTestBF to test models including locus age as an explanatory variable. Sample from the posterior distributions. Please note that these scripts also contain an earlier version of the sample age models that incorrectly included sample as a random effect (which is confounded with sample age).

Input:

Genotype call .Rdata files from step 21

Output:

BF.tar.gz contains five .Rdata files (one for each microsatellite subset) with BayesFactor objects and posterior samples

Data from: Ancient and modern genomes reveal microsatellites maintain a dynamic equilibrium through deep time

Data files

Abstract

README: Data from: Ancient and modern genomes reveal microsatellites maintain a dynamic equilibrium through deep time