Replaying the evolutionary tape to investigate subgenome dominance in allopolyploid Brassica napus
Bird, Kevin et al. (2021), Replaying the evolutionary tape to investigate subgenome dominance in allopolyploid Brassica napus, Dryad, Dataset, https://doi.org/10.5061/dryad.h18931zjr
Interspecific hybridization and allopolyploidization merge evolutionarily distinct parental genomes (subgenomes) into a single nucleus. A frequent observation is that one subgenome is "dominant” over the other subgenome, having a greater number of retained genes and being more highly expressed. Which subgenome becomes dominantly expressed in allopolyploids remains poorly understood. Here we “replayed the evolutionary tape” with six isogenic resynthesized Brassica napus (rapeseed) allopolyploid lines and investigated subgenome dominance patterns over the first ten generations post merger. We found that the same subgenome was consistently more dominantly expressed in all lines and generations and that >70% of biased gene pairs showed the same dominance patterns across all lines and an in silico hybrid of the parents. Gene network analyses indicated an enrichment for network interactions and several biological functions for the Brassica oleracea derived 'BnC' subgenome biased pairs, but no enrichment was identified for Brassica rapa derived ‘BnA’ subgenome biased pairs. Furthermore, DNA methylation differences between subgenomes mirrored the observed gene expression bias towards the 'BnC' subgenome in all lines and generations. These methylation patterns were consistent with those previously associated with higher expression, but differ from proposed mechanisms from recent conceptual models and with observations in other polyploid systems that exhibit subgenome dominance. Many of these differences in gene expression and methylation were also found when comparing the progenitor genomes, suggesting subgenome dominance is partly related to parental genome differences rather than just a byproduct of allopolyploidization. These findings demonstrate that "replaying the evolutionary tape" in an allopolyploid results in largely repeatable and predictable subgenome expression dominance patterns, partly due to preexisting genetic differences among the parental species.
Homoeologous exchange analysis
Paired end 150bp genomic illumina reads were filtered with Trimmomatic v 0.33 (Bolger et al. 2014) to remove illumina TruSeq3 adapters. Trimmed reads were aligned to the in silico B.napus reference genome with Bowtie2 v.188.8.131.52(Langmead and Salzberg 2012) on default settings with the flag “--very-sensitive-local”. Bam files sorted with bamtools (Barnett et al. 2011) for use in downstream analyses.
MCScan toolkit (Tang et al. 2008) was used to identify syntenic, homologous gene pairs (syntelogs) between Brassica rapa (reference genome R500) and Brassica oleracea (reference genome TO1000; Parkin et al. 2014). In the synthetic polyploid these can be thought of as syntenic homoeologs. Bed files based on chromosome and start/stop position information for each subgenome were generated. For all 18 samples (6 individuals x 3 generations) read depth for the A subgenome (BnA) syntenic homoeologs was determined in Bedtools (Quinlan and Hall 2010) with BedCov using the R500 syntelog bed file and for the C subgenome (BnC) using the TO1000 syntelog bed file. In R v 3.4.1, read depths for each syntenic homoeolog was normalized to reads per million for subgenome of origin and the ratio of reads mapping to a syntenic homoeolog compared to the overall read mapping for a syntenic homoeolog pair was averaged over a window of 50 genes with a step of one gene.
Homoeologous exchanged regions were identified by calculating average read depth for the BnC subgenome along a sliding window of 170 (85 up- and down stream) genes and step size of one. If 10 or more consecutive genes had a read depth within a pre-selected range it was called a homoelogous exchange. Regions 0 ≤ read depth < 0.2 were predicted to be in a 0BnC-to-4BnA ratio, 1BnC-to-3BnA was predicted for 0.2 ≤ read depth < 0.4, 2BnC-to-2BnA was predicted for 0.4 ≤ read depth < 0.6, 3BnC-to-1BnA for read depth between 0.6 ≤ read depth <0.8 and 4BnC-to-0BnA for read depth between 0.8 ≤ read depth < 1.
Raw RNA-seq reads were filtered with Trimmomatic v 0.33 (Bolger et al. 2014) to remove illumina TruSeq3 adapters and mapped to the in silico reference using STAR v 2.6.0 (Dobin et al. 2013) on default settings. Transcripts were quantified in transcripts per million (TPM) from RNAseq alignments using StringTie v 1.3.5 (Pertea et al. 2015). Because the syntelogs in the progenitor genomes are in the subgenomes of the synthetic polyploids, they can be thought of as syntenic homoeologs. To avoid dosage imbalance, only syntenic homoeologs determined to be at a 2:2 dosage balance were analyzed for homoeolog expression bias. Additionally, to remove lowly expressed genes that might be noise, syntenic homoeologs were only kept if the total TPM of the pair was greater than 10. Syntenic homoeolog pairs with Log2 Foldchange greater than 3.5 were called BnC biased, and less than 3.5 were called BnA biased. This cutoff follows the practice of Woodhouse et al. (2014) which used a log FC cutoff of 2 to determine homoeolog expression bias, however to more confidently reduce false positives a higher FC cutoff of 3.5 was used. Because lack of subgenome dominance would follow a normal distribution where deviations from 0 FC is equal in either direction, a Chi-squared goodness of fit test was carried out to test for normality. The R package Upsetr was used to identify and plot syntenic homoeologs shared by all lines for a given generation. For each generation, Arabidopsis thaliana orthologs were identified for genes showing the same subgenome bias in all six lines and the progenitors and were investigated for GO and KEGG pathway enrichment (Ashburner et al. 2000; Kanehisa and Goto 2000) in the STRING PPI network (Szklarczyk et al. 2017) using the online STRING network search application. STRING also calculated and reported average node degree, clustering coefficients, and enrichment for network interactions.
DNA methylation analysis
Whole genome bisulfite sequencing (WGBS) data was mapped to the combined in silico reference genome using methylpy v1.3.8 (Schultz et al. 2015) (see Supplementary Table X); using cutadapt v2.3 (Martin 2011) for adaptor trimming, Bowtie2 v2.3.5 (Langmead and Salzberg 2012) for alignment, and Picard tools v2.20.2 for marking duplicates. The chloroplast genome is unmethylated in plants and was used as an internal control for calculating the non-conversion rate of bisulfite treatment, the percentage of unmethylated sites that fail to be converted to uracil (Lister et al. 2008). Methylpy accounts for this non-conversion in calling methylated sites.
When the parental WGBS data was mapped to the combined genome (TO100 + R500), a small fraction of reads of each sample mapped to the other sub-genome, ~1.3% TO1000 to B. rapa and ~6.1% IMB218 to B. oleracea. We compared results from mapping of the parental data to either the combined genome or their own respective genome. There was little difference in DNA methylation levels or patterns for either parent and therefore concluded that the impact of this mismapping was insignificant. As a further control, we created a ‘mock’ allopolyploid in silico. to serve as The TO1000 data was randomly downsampled to an equal number of read pairs as IMB218. The two datasets were combined and mapped to the combined genome to mimic an in silico allopolyploid. DNA methylation levels in this ‘mock’ allopolyploid were either approximately half-way between the two parents for the whole genome or nearly identical to their respective parent at a sub-genome level. If DNA methylation in the resynthesized lines is simply a combination of both parent’s methylomes, then we expect global DNA methylation in the resynthesized lines to be similar to this combined mock dataset. Deviation from this pattern would indicate global remodeling of DNA methylation.
Genome-wide levels of DNA methylation and DNA methylation metaplots were analyzed as previously described (Niederhuth et al. 2016) using python v3.7.3, Pybedtools v0.8 (Dale et al. 2011), and Bedtools v2.25.0 (Quinlan and Hall 2010). Genome-wide DNA methylation levels were calculated for each sequence context (CG, CHG, and CHH) using the weighted methylation level (Schultz et al. 2012), which accounts for sequencing coverage. For gene metaplots, cytosines from 2 kilobase (kb) upstream, 2 kb downstream and within the gene/TE body were extracted. For gene bodies, only cytosines in coding sequences were used, as the presence of TEs in introns and problems of proper UTR annotation can obscure DNA methylation at start/stop sites and introduce misleadingly high levels of DNA methylation (Niederhuth et al. 2016). Each of these three regions (upstream, gene body, and downstream) were then divided into 20 windows and the weighted methylation level for each window calculated and averaged for all genes. For LTR metaplots, the same analysis was performed, except all cytosines within the LTR body were included. Plot were made in R v3.6.0 (R Core Team, 2013) using ggplot2 (Wickham 2009). All code and original analyzed data and plots are available on Github (https://github.com/niederhuth/Replaying-the-evolutionary-tape-to-investigate-subgenome-dominance).
All intermediate files are present to reproduce figures from the manuscript. BNapusRNASeqPlots.R reads necessary files, processes them, and produces figures. All file paths are standardized with here() package in R. BisfulfiteSeq may be aided by the github repo https://github.com/niederhuth/Replaying-the-evolutionary-tape-to-investigate-subgenome-dominance
National Natural Science Foundation of China, Award: 31471173
National Natural Science Foundation of China, Award: 31871239
Directorate for Biological Sciences, Award: 2029959
National Agricultural Statistics Service
National Science Foundation, Award: 1424871