The X chromosome of insects likely predates the origin of Class Insecta
Data files
Sep 15, 2023 version files 672.60 KB
-
Damsel.gff
660.12 KB
-
README.md
12.48 KB
Abstract
Sex chromosomes have evolved independently multiple times, but why some are conserved for more than 100 million years whereas others turnover rapidly remains an open question. Here, we examine the homology of sex chromosomes across nine orders of insects, plus the outgroup springtails. We find that the X chromosome is likely homologous across insects and springtails; the only exception is in the Lepidoptera, which has lost the X and now has a ZZ/ZW sex chromosome system. These results suggest the ancestral insect X chromosome has persisted for more than 450 million years – the oldest known sex chromosome to date. Further, we propose that the shrinking of gene content of the Dipteran X chromosome has allowed for a burst of sex-chromosome turnover that is absent from other speciose insect orders.
Data files and pipeline for:
"The X chromosome of insects likely predates Class Insecta"
In Review: Evolution, 2023
#########################################################################################
All data files are publicly available. Table S1 from supplement with this information:
Isoperla grammatica (Plecoptera):
genome: GCA_945910005.1 (link: https://www.ncbi.nlm.nih.gov/assembly/GCA\_945910005.1/)
RNA: ERR9881692 (link: https://www.ncbi.nlm.nih.gov/sra/?term=ERR9881692)
Schistocerca americana (Orthoptera)
genome: GCF_021461395.2 (link: https://www.ncbi.nlm.nih.gov/assembly/GCF\_021461395.2/)
Bombyx mori (Lepidoptera)
genome: GCF_014905235.1 (link: https://www.ncbi.nlm.nih.gov/assembly/GCF\_014905235.1/)
Lucilia cuprina (Diptera)
genome: GCF_022045245.1 (link: https://www.ncbi.nlm.nih.gov/assembly/GCF\_022045245.1/)
Coccinella septempunctata (Coleoptera)
genome: GCF_907165205.1 (link: https://www.ncbi.nlm.nih.gov/assembly/GCF\_907165205.1/)
Chrysoperla carnea (Neuroptera)
genome: GCF_905475395.1 (link: https://www.ncbi.nlm.nih.gov/assembly/GCF\_905475395.1/)
Acyrthosiphon pisum (Hemiptera)
genome: GCF_005508785.2 (link: https://www.ncbi.nlm.nih.gov/assembly/GCF\_005508785.2/)
Sinella curviseta (Collembola)
genome: GCA_004115045.3 (link: https://www.ncbi.nlm.nih.gov/assembly/GCA\_004115045.3/)
RNA: SRR7948082 (link: https://www.ncbi.nlm.nih.gov/sra/?term=SRR7948082)
DNA: SRR7948081 (link: https://www.ncbi.nlm.nih.gov/sra/?term=SRR7948081)
Allacma fusca (Collembola)
genome: GCA_947179485.1 (link: https://www.ncbi.nlm.nih.gov/assembly/GCA\_947179485.1)
RNA: SRR1653171 (link: https://www.ncbi.nlm.nih.gov/sra/?term=SRR1653171)
Ischnura elegans (Odonata)
genome: GCF_921293095.1 (link: https://www.ncbi.nlm.nih.gov/assembly/GCF\_921293095.1/)
#########################################################################################
#########################################################################################
Included example dataset: Damsel.gff (gff for damselfly produced through Genespace (see below) - used as input in various parts of pipeline
column info:
Damsel.chr - chromosome location
start - gene start
end - gene end
Damsel - gene name
strand - strand (+/-)
Damsel.ord - gene order
#########################################################################################
- Downloading Data from SRA
module load SRA-Toolkit/2.11.2
fasterq-dump --threads 4 --split-files - FastQC on Data from SRA
fastqc -f fastq - Trimming Data from SRA
module load trimmomatic/0.39
java -jar trimmomatic-0.39.jar PE sample.fastq sample.fastq sample_1.paired.fq sample_1.unpaired.fq sample_2.paired.fq sample_2.unpaired.fq ILLUMINACLIP:TruSeq2-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36 - Transcriptome Assembly
module load hisat2
module load tophat
module load cufflinks
module load bowtie2/2.4.4
module load python/2.7
module load samtools
module load hisat2
module load stringtie
#Build Genome Index in HISAT2
hisat2-build genome_index
#Aligning reads to genome using HISAT2
hisat2 --phred33 -p 50 --novel-splicesite-outfile hisat2/splicesite.txt -S hisat2/accepted_hits.sam -x genome_index -1 sample_1.paired.fq.gz -2 sample_2.paired.fq.gz
#SamtoBam in Samtools, sorting bamfile
samtools view -bS -o hisat2/accepted_hits.bam hisat2/accepted_hits.sam
samtools sort -o hisat2/accepted_hits.sorted.bam hisat2/accepted_hits.bam
#Make GTF file Stringtie - will be slightly modified as input, renamed to gff for GENESPACE
stringtie hisat2/accepted_hits.sorted.bam -o transcripts.gtf
- Using Transdecoder to extract ORFs, using blast to search for homology in uniprot databse
module load ncbi-blast/2.2.31+
module load R/4.2.2
/Transdecoder5.5/TransDecoder-TransDecoder-v5.5.0/util/gtf_genome_to_cdna_fasta.pl transcripts.gtf genome.fasta > transcripts.TDCL.fasta
/Transdecoder5.5/TransDecoder-TransDecoder-v5.5.0/TransDecoder.LongOrfs -t transcripts.TDCL.fasta
blastp -query transcripts.TDCL.fasta.transdecoder_dir/longest_orfs.pep -db uniprot_sprot.fasta -max_target_seqs 1 -outfmt 6 -evalue 1e-5 -num_threads 16 > blastp.outfmt6
/Transdecoder5.5/TransDecoder-TransDecoder-v5.5.0/TransDecoder.Predict -t transcripts.TDCL.fasta --retain_blastp_hits blastp.outfmt6 - Using SpliceFinder_2.pl to find longest isoform (SpliceFinder_2.pl enclosed in scripts), output will be used as input for GENESPACE and reciprocal blasts
module load blat
perl SpliceFinder_2.pl transcripts.TDCL.fasta.transdecoder.pep - Vim commands to alter the output of the gtf (step 4) and fasta file (step 6) to make it suitable for GENESPACE input
Steps to edit gtf
:%s/gene_id /gene_id=/g
:%s/transcript_id /transcript_id=/g
:%s/"//g
rename transcripts.gtf to transcripts.gff
rename transcripts.TDCL.fasta.transdecoder.pep.long to transcripts.TDCL.fasta.transdecoder.pep.long.faa
#########################################################################################
#########################################################################################
Identifying the X chromosome in S. curviseta
- Aligning reads from SRA to genome
module load bowtie2/2.4.4
bowtie2-build GCA_004115045.3_ASM411504v3_genomic.fna Sinella --threads 2
bowtie2 -x Sinella -1 SRR7948081_1.fastq -2 SRR7948081_2.fastq --end-to-end --sensitive -p 8 -S SRR7948081.sam --threads 8 - Extract unique reads
grep -vw "XS:i" SRR7948081.sam > SRR7948081.unique.sam - Compute coverage in 10k windows
module load soap/coverage
soap.coverage -sam -cvg -i SRR7948081.unique.sam -onlyuniq -p 8 -refsingle GCA_004115045.3_ASM411504v3_genomic.fna -window SRR7948081window 10000 - R script to identify chromosome with lowest coverage, plot coverage per chromosomes (Figure S1)
IdentifyX.Sinella.R
#########################################################################################
#########################################################################################
Preparing files for T. cristinae (pipeline altered since no chromosome-wide genome assembly)
- Download files from: https://github.com/DarrenJParker/Timema\_sex\_chr\_evol\_code/tree/main/data
Tce_b3v08.fasta
Tce_b3v08.max_arth_b2g_droso_b2g.gff
TCE_1000_chr_info_and_counts.txt - Extract AA sequences from gff and genome
Download getAnnoFastaFromJoingenes.py: https://github.com/Gaius-Augustus/Augustus/tree/master
getAnnoFastaFromJoingenes.py -g Tce_b3v08.fasta -o Timema.fasta -3 Tce_b3v08.max_arth_b2g_droso_b2g.gff -s FILTER - Make GFF from sequences maintained in step 2 (lots of sequences removed because of premature stop codons)
grep ">" Timema.fasta.aa > Timema.sequence.txt (then remove > in vim)
python script to make new gff: get.filtered.gff.py (included)
cut -d$'\t' -f1,2,5,6 Timema.Sequence.gff > Filter.txt
Filter.txt - used as input in R scripts to assess homology and compute expectations
Timema.fasta.aa - used as input for reciprocal blast analysis
#########################################################################################
#########################################################################################
Commands to perform reciprocal blasts (identify 1:1 orthologs) between each species and the reference
(commands identical for I. elegans or S. curviseta)
- Blast: Reference Species as Query, Focal as target
module load ncbi-blast/2.2.31+
makeblastdb -in Focal.fa -dbtype prot
blastp -query Reference.fa -db Focal.fa -out blast.out.txt -evalue 1e-4 -num_threads 4 -max_target_seqs 1 -outfmt 6
/trinityrnaseq-Trinity-v2.8.4/util/misc/blast_outfmt6_group_segments.pl blast.out.txt Reference.fa Focal.fa > blast.Ref.Foc.group.txt - Blast: Focal Species as Query, Reference as target
makeblastdb -in Reference.fa -dbtype prot
blastp -query Focal.fa -db Reference.fa -out blast.out.txt -evalue 1e-4 -num_threads 4 -max_target_seqs 1 -outfmt 6
/trinityrnaseq-Trinity-v2.8.4/util/misc/blast_outfmt6_group_segments.pl blast.out.txt Focal.fa Reference.fa > blast.Foc.Ref.group.txt - Obtain reciprocal best hits - requires minimum length of 50%, minimum ID 40% (get_recip_50.py enclosed in scripts - requires python2)
#########################################################################################
#########################################################################################
R Scripts to assess homology between X chromosomes
Inputs required: Focal.gff, Reference.gff, Reciprocal Best hits (recip_matches_min50.txt, output from python script "get_recip_50.py above")
Note that gffs require some light editing before input:
change column chr -> .chr
change column ID ->
change column ord -> .ord
Grasshopper as an example: Grasshopper.chr start end Grasshopper strand Grasshopper.ord
- Using Damselfly as reference
Stats.Damsel.Expectation.Allacma.r
Stats.Damsel.Expectation.Aphid.r
Stats.Damsel.Expectation.Fly.r
Stats.Damsel.Expectation.Grass.r
Stats.Damsel.Expectation.Lacewing.r
Stats.Damsel.Expectation.Ladybug.r
Stats.Damsel.Expectation.Silkworm.r
Stats.Damsel.Expectation.Stonefly.r
Stats.Damsel.Expectation.Sinella.r
Stats.Damsel.Expectation.Timema.r - Using Sinella (Springtail) as reference
Stats.Sinella.Expectation.Allacma.r
Stats.Sinella.Expectation.Aphid.r
Stats.Sinella.Expectation.Fly.r
Stats.Sinella.Expectation.Grass.r
Stats.Sinella.Expectation.Lacewing.r
Stats.Sinella.Expectation.Ladybug.r
Stats.Sinella.Expectation.Silkworm.r
Stats.Sinella.Expectation.Stonefly.r
Stats.Sinella.Expectation.Sinella.r
Stats.Sinella.Expectation.Timema.r
#########################################################################################
#########################################################################################
R Scripts to compute significance using Monte Carlo approach
Inputs required: Focal.gff, Reference.gff, Reciprocal Best hits
(recip_matches_min50.txt, output from python script "get_recip_50.py above")
- Using Damselfly as reference
Stats.Damsel.Allacma.r
Stats.Damsel.Aphid.r
Stats.Damsel.Fly.r
Stats.Damsel.Grass.r
Stats.Damsel.Lacewing.r
Stats.Damsel.Ladybug.r
Stats.Damsel.Silkworm.r
Stats.Damsel.Stonefly.r
Stats.Damsel.Sinella.r
Stats.Damsel.Timema.r - Using Sinella (Springtail) as reference
Stats.Sinella.Allacma.r
Stats.Sinella.Aphid.r
Stats.Sinella.Fly.r
Stats.Sinella.Grass.r
Stats.Sinella.Lacewing.r
Stats.Sinella.Ladybug.r
Stats.Sinella.Silkworm.r
Stats.Sinella.Stonefly.r
Stats.Sinella.Timema.r
#########################################################################################
#########################################################################################
Synteny plots in GENESPACE (v0.94)
- Make synteny plot 4 taxa (Damselfly, Grasshopper, Lacewing, Ladybug)
script: FourSpecies.Synteny.Genespace.R - Make pairwise synteny plots for 8 taxa against Damselfly
script: Pairwise.Synteny.Genespace.R
#########################################################################################
#########################################################################################
Identifying shared genes on the X
Note: this was run with all species with an X (all groups except Silkworm) and again excluding Diptera
Output is data used for Figure 2
- Use peptide sequence output from GENESPACE processing as input for Orthofinder
module load orthofinder
orthofinder -f Orthofinder/ - Use custom python script to extract single-copy orthologs across all species
Input: Orthofinder/Orthogroups/Orthogroups_SingleCopyOrthologues.txt
Input: Orthofinder/Orthogroups/Orthogroups.tsv
Script: GetSingleCopy.py (included in scripts)
Output: All9SC.txt - Use R Script to count number of shared 1:1 orthologs at each node
Input: All9SC.txt (produced above)
Input: GFF files from GENESPACE
Script: GetGenesOnX9taxa.R (included in scripts)
#########################################################################################
This analysis uses publicaly available data. It has been processed through a series of custom scripts (included) and bioinformatics software programs.
- Toups, Melissa A.; Vicoso, Beatriz (2023), The X chromosome of insects predates the origin of Class Insecta, [], Posted-content, https://doi.org/10.1101/2023.04.19.537501
- Toups, Melissa; Vicoso, Beatriz (2023), The X chromosome of insects likely predates the origin of Class Insecta, , Article, https://doi.org/10.5281/zenodo.8138705
- Toups, Melissa A; Vicoso, Beatriz (2023). The X chromosome of insects likely predates the origin of class Insecta. Evolution. https://doi.org/10.1093/evolut/qpad169
