The genetic basis of cytoplasmic male sterility and fertility restoration in wheat
Data files
Jan 08, 2021 version files 604.53 MB
Abstract
Hybrid wheat varieties give higher yields than conventional lines but are difficult to produce due to a lack of effective control of male fertility in breeding lines. One promising system involves the Rf1 and Rf3 genes that restore fertility of wheat plants carrying Triticum timopheevii-type cytoplasmic male sterility (T-CMS). By genetic mapping and comparative sequence analyses we identified Rf1 and Rf3 candidates that could restore normal pollen production in transgenic wheat plants carrying T-CMS. We show that Rf1 and Rf3 bind to the mitochondrial orf279 transcript and induce cleavage, preventing expression of the CMS trait. The identification of restorer genes in wheat is an important step towards the development of hybrid wheat varieties based on a CMS-Rf system. The characterisation of their mode of action brings new insights into the molecular basis of CMS and fertility restoration in plants.
This dataset includes transcript count and coverage data from 2 RNA-seq experiments looking at gene expression in various male-sterile or male-fertile wheat lines examined in the course of this research.
Methods
Dataset 1 is derived from the NCBI SRA BioProject PRJNA675907 which contains paired-end, random-primed, rRNA-depleted, strand-specific RNA-seq reads from 36 wheat samples (anthers). The raw read data can be obtained from NCBI SRA. RNA was extracted from anthers using the RNAeasy Plant Mini Kit (Qiagen, USA) and its quality was estimated on an Agilent 4200 tape station (Agilent, USA). The libraries were made with the TruSeq Stranded Total RNA Ribo Zero Samples Prep Kit (Illumina, USA) and sequenced on a NovaSeq 6000 platform (Illumina) with 150 nt paired-end reads at Novogene. Reads were adapter-trimmed with bbduk (parameters ktrim=r k=23 mink=11 hdist=1 tpe tbo ftm=5). Salmon (v1.3.0) was used to assign reads to transcripts and calculate transcripts per million values. For nuclear/cytosolic transcripts, the IWGSC 1.1 annotations were used as a reference, but with the Chinese Spring RFL transcripts replaced by captured RFL sequences from the sequenced genotype. For mitochondrial transcripts, annotated coding sequences from the T. timopheevii mitochondrial genome (NC_022714) were used, supplemented with ten T. timopheevii-specific ORFs of over 100 codons.
Dataset 2 is derived from the NCBI SRA BioProject PRJNA595431 which contains paired-end, random-primed, rRNA-depleted, strand-specific RNA-seq reads from 10 wheat samples (flowering heads) differing by genotype. The raw read data can be obtained from NCBI SRA. RNA was extracted from young spikes using the RNAeasy Plant Mini Kit (Qiagen, USA) and its quality was estimated on an Agilent 4200 tape station (Agilent, USA). The libraries were made with the TruSeq Stranded Total RNA Ribo Zero Samples Prep Kit (Illumina, USA) and sequenced on a Hiseq4000 platform (Illumina) with 100 nt paired-end reads at Novogene. Reads were adapter-trimmed with bbduk (parameters ktrim=r k=23 mink=11 hdist=1 tpe tbo ftm=5). Salmon (v1.3.0) was used to assign reads to transcripts and calculate transcripts per million values. The IWGSC 1.1 annotations were used as a reference, but with the Chinese Spring RFL transcripts replaced by captured RFL sequences from the sequenced genotype. For analysis of read coverage, adapter-trimmed reads were mapped to the T. timopheevii mitochondrial genome (NC_022714) with bbmap. Multi-mapped reads were distributed randomly between the best-matching sites and rRNA regions were masked (because rRNA depletion was inconsistent across samples). Regions identical to plastid DNA were masked to avoid cross-mapped plastid reads. Read coverage was calculated with genomeCoverageBed (Bedtools 2 package) and normalised by dividing by mean coverage depth excluding the masked regions.
Usage notes
For dataset 1, the files included here are:
- experimental_design.xlsx — lists the samples and genotypes
- references — folder of fasta files containing reference transcripts for the respective genotypes (input to Salmon)
- quants — folder of quant.sf files containing nuclear/cytosolic transcript counts (output from Salmon)
- mt_quants — folder of quant.sf files containing mitochondrial transcript counts (output from Salmon)
- rnaseq.ipynb — Jupyter notebook (Python code) to reproduce Fig. 2b and Fig. S2 from the paper using the quants files (requires Python packages pandas, numpy, matplotlib, seaborn, sklearn and diffexpr (https://github.com/wckdouglas/diffexpr))
- mt.ipynb — Jupyter notebook (Python code) to reproduce Figs. 2c and Fig. 2d from the paper using the mt_quants files
For dataset 2, the files included here are:
- references — folder containing a fasta file containing reference transcripts (input to Salmon)
- RNASeq_quants.xlsx — table of read counts extracted from Salmon output
- mt_cov — folder of strand-specific read coverage files (generated by genomeCoverageBed from the bedtools2 package)
- Transgene_TPM.ipynb — Jupyter notebook (Python code) to reproduce Fig. 3c from the paper using the quants files
- mt_coverage.ipynb — Jupyter notebook (Python code) to reproduce Figures 5 and S5 from the paper using the mt_cov files
The source data underlying Figs 2a, 3b-f, 4c-e, 6c, 7b and Supplementary Figs S3b, S4b-e, S6a and S7b-d are provided as a Source Data zip file.