Supporting information of genome biology and evolution of mating type loci in four cereal rust fungi
Cite this dataset
Luo, Zhenyan; Schwessinger, Benjamin (2024). Supporting information of genome biology and evolution of mating type loci in four cereal rust fungi [Dataset]. Dryad. https://doi.org/10.5061/dryad.w0vt4b8zm
Abstract
Sex in animals and some plants is determined by sex chromosomes. In fungi, mate compatibility is determined by mating type (MAT) loci, which share some features with sex chromosomes including recombination suppression around heterozygous loci. Here, we study the MAT loci in fungal pathogens from the order Pucciniales, which cause rust diseases on many economically important plants, including wheat and oats. We show that one of the MAT loci is multiallelic, while the other is biallelic in most cases. The biallelic locus shows strong signs of recombination suppression and genetic deterioration with an increase in the number of transposable elements and gene deserts surrounding the locus. Our findings on the genome biology of MAT loci in four economically important pathogens will improve predictions on potential novel virulent isolates that can lead to large-scale pandemics in agriculture.
This dataset contains data related to the study, including data on geneological analysis, alignments of HD, Pra, STE3.2-1, and mfa genes, dS value matrices, TE annotation files, and other TE study-related data, and data related to RNA expression analysis.
README: Supporting information of genome biology and evolution of mating type loci in four cereal rust fungi
https://doi.org/10.5061/dryad.w0vt4b8zm
Description of the data and file structure
##DATA & FILE OVERVIEW
1. Structure of the folder
├── Alignments
│ ├── Alignments for Fig 1 Fig 2 S3Fig S4Fig
│ ├── Alignments for S22Fig S23Fig
│ ├── Au-test
│ └── Cds of HD alleles for S21 Fig
├── Other supporting data
│ ├── Beast2
│ ├── Major TE compositions of MAT locus (4 Pt isolates).xlsx
│ ├── Major TE compositions of MAT locus (4 cereal rust species).xlsx
│ ├── RNA Analysis
│ ├── RDP5 output
│ ├── Repeat Classification
│ └── ds value table.xlsx
└── README.txt
Abbreviations:
Pca - Puccinia coronata f. sp. avenae
Pgt - Puccinia graminis f. sp. tritici
Pt - Puccinia triticina
Pst - Puccinia striiformis f. sp. tritici
HD - Homeodomain transcription factor genes
PR - Pheromone receptor
Pra - Pheromone receptor genes (including STE3.2-2 and STE3.2-3)
mfa - Pheromone precursor gene
MFA - Pheromone precursor gene encoding protein
aa - amino acid sequence
nuc - nucleotide sequence
Other Notes:
Numbers after abbreviations of species represent corresponding isolate names.
###########################################################################################################
DATA-SPECIFIC INFORMATION FOR FOLDER: "Aligments/Alignments for Fig 1 Fig 2 S3Fig S4Fig"
File list:
├── MFA_aln.aa
├── Pra_aln.aa.fasta
├── Pra_aln.nucl.fasta
├── STE3.2-1_aln.aa.fasta
├── STE3.2-1_aln.nucl.fasta
├── bE-HD_aln.aa.fasta
├── bE-HD_aln.nucl.fasta
├── bW-HD_aln.aa.fasta
├── bW-HD_aln.nucl.fasta
└── mfa_aln.nucl.fasta
Files above are untrimmed nucleotide sequence/amino acid sequence alignments generated with MACSE, Pra_aln includes sequences of both STE3.2-3 and STE3.2-2.
###########################################################################################################
DATA-SPECIFIC INFORMATION FOR FOLDER: "Alignments/Alignments for S22Fig S23Fig"
File list:
├── Pca_STE3.2-2_aa.phy
├── Pca_STE3.2-2_nuc.phy
├── Pca_STE3.2-3_aa.phy
├── Pca_STE3.2-3_nuc.phy
├── Pgt_STE3.2-2_aa.phy
├── Pgt_STE3.2-2_nuc.phy
├── Pgt_STE3.2-3_aa.phy
├── Pgt_STE3.2-3_nuc.phy
├── Pgt_mfa1_aa.phy
├── Pgt_mfa2_aa.phy
├── Pgt_mfa3_aa.phy
├── Pst_STE3.2-2_aa.phy
├── Pst_STE3.2-2_nuc.phy
├── Pst_STE3.2-3_aa.phy
├── Pst_STE3.2-3_nuc.phy
├── Pt_STE3.2-2_aa.phy
├── Pt_STE3.2-2_nuc.phy
├── Pt_STE3.2-3_aa.phy
└── Pt_STE3.2-3_nuc.phy
STE3.2-_aa.phy files above are used for generated S22 and S23 Figs, STE3.2-_nuc.phy files are coding regions predicted.
###########################################################################################################
DATA-SPECIFIC INFORMATION FOR FOLDER: "Alignments/Au-test"
Each subfolder represent AU-test done on each species
${species_name}_bE-HD_realn.aln in each folder are cds of bE-HD2 aligned by MACSE, trimmed by trimAl then realign in the same way with MACSE.
${species_name}_bW-HD_realn.aln in each folder are cds of bW-HD1 aligned by MACSE, trimmed by trimAl then realign in the same way with MACSA.
${speices_name}_HD.to_test.trees contain unique trees of bE-HD2 and bW-HD1 which used in for AU-test.
Files end with 'iqtree', '-au-test', 'trees' are output of running iqtree2 AU-test, more information can be found in github repo.
###########################################################################################################
DATA-SPECIFIC INFORMATION FOR FOLDER: "Alignments/Cds of HD alleles for S17 Fig"
This folder contains predicted cds of bE-HD2 and bW-HD1 alleles of Pca, Pgt and Pt, HD alleles of Pst are not include since cds of HD alleles are not generated in our study.
###########################################################################################################
DATA-SPECIFIC INFORMATION FOR FOLDER: "Beast2"
├── bmodeltest <-------------This folder contains bmodeltest result which used to decide the model chose for each gene tree
│ └── bmodeltest.log.log
├── logfiles <-----------This folder contains running information of BEAST2
│ ├── HD.log
│ └── PR.log
├── trees <-----------This folder contains raw tree files generated for BEAST2 which used as input files of TreeAnnotator
│ ├── PR-realn_trim_macse_int_nuc_PR.trees
│ ├── all_HD-realn_trim_macse_int_nuc_bE-HD_aln.trees
│ ├── all_HD-realn_trim_macse_int_nuc_bW-HD_aln.trees
│ └── all_PR-realn_trim_macse_int_nuc_STE3_2-1_aln.trees
└── xml <--------------This folder contains output files of BEAUti which used for testing best Bayasian model and reconstructing gene trees
├── HD.xml
└── PR.xml
###########################################################################################################
DATA-SPECIFIC INFORMATION FOR FOLDER: "RDP5 output"
├── Pca_RDP.rdp5
├── Pgt_RDP.rdp5
├── Pst_RDP.rdp5
└── Pt_RDP.rdp5
These RDP project files are generated with RDP5 on window 8.1, 10kb proximal regions of HD locus (region include bE-HD2 and bW-HD1) were aligned and trimmed manually to remove both ends of the alignments. Trimmed alignments were used as input files for RDP5 seperately, tested with 7 methods using default setting. These files allow potential recombination events identified be checked.
###########################################################################################################
DATA-SPECIFIC INFORMATION FOR FOLDER: "Repeat Classification"
├── Pca203_repet.gff3
├── Pca_12NC29_reclass.csv
├── Pca_12NC29_sim_consensus_classif.csv
├── Pgt210_repet.gff3
├── Pgt_210_reclass.csv
├── Pgt_210_sim_consensus_classif.csv
├── Pst134E_reclass.csv
├── Pst134E_repet.gff3
├── Pst_134E_sim_consensus_classif.csv
├── Pt76_reclass.csv
├── Pt76_repet.gff3
└── Pt_76_sim_consensus_classif.csv
${species_name}_repet.gff3 are files output from TEannot (REPET v3.0) pipeline, containing repeat annotation and classification of corresponding TE.
${species_name}_sim_consensus_classif.csv are files contain classification of each TE family, description of each column can be found below:
seq_name:TE family names given by the REPET pipeline
length: Length of consensus sequence of this TE family
strand: Strand of this TE family
confused: Whether the classification of this TE family can be more than two? FALSE: No / True: Have more than two potential classification
class_classif: This TE family belongs to class I or class II TE family?
-NA: Might be potential host gene (PHG) ;
-Unclassified: Undetermined TE family;
-Unclassified class I/II: Only can be classified in class I/II based on structure
order_classif: This TE family can be classified into which order (based on Wicker's classification system)?
-NA/Unclassified: Undetermined
Wcode: Wicker's classification code of this TE family in superfamily level.
-NA: Undetermined
sFamily_classif: Super family classification level.
-NA: Undetermined
CI: Numbers of TE classified which related to classification.
-NA: Undetermined
coding: Motifs or hits related to TE classifications.
struct: Structure of the consensus sequence of this TE family.
other: Other profiles related to classification.
----------------------------------------------------------------------------------
${species_name}_reclass.csv are files which reclassified based on gff3 and classif.csv files above, these reclass.csv files related to statistic analysis of TEs in order rank in paper.
parent_ids: TE family names given by the REPET pipeline, identical to that in classif.csv
class: Classification of TE families in the class rank, for TE family which doesn't have determined class in other two files, the class was classified based on column 'other' in class_if.csv file, if one of classified TE has more than 70% similarity to the target TE family, the TE will be classified based on that pre-classified TE.
Order: Same as above but in order rank.
-NA:Fail to be reclassify based on the condition: 70% similarity to any pre-classified TE in RepBase.
Wcode: This column is the same as the class_if.csv
###########################################################################################################
DATA-SPECIFIC INFORMATION FOR FOLDER: "RNA analysis"
├── Pca12NC29_cpms.xlsx
├── Pgt210_cpms.xlsx
├── Pst104E_cpms.xlsx
├── Pst87_cpms.xlsx
└── RNA reads mapped.xlsx
${species_name}_cpms.xlsx are TMM-normalized Gene Expression Matrix (CPM) output from EdgeR for each experiment, more details how these files are generated please see the github repo: codes-used-for-mating-type
/8. RNA expression of MAT. 'RNA reads mapped.xlsx' contains information of counts of RNA reads can be mapped at each timepoint each replicate per experiment respectively.
###########################################################################################################
DATA-SPECIFIC INFORMATION FOR FILE: "ds value table.xlsx"
Target / Query : gene names in gene-pairs identified by proteinortho
evalue_ab / evalue_ba & bitscore_ab / bitscore_ba : E-Values and bit scores for both directions A->B and B->A are printed behind each match.
same_strand : whether these paired genes are in same strand, same strands (1) or not (-1)
simscore : similarity score
protein_hamming : the hamming distance between protein sequences of these two genes
protein_levenshtein : the Levenshtein distance between protein sequences of these two genes
cds_hamming : the hamming distance between coding sequences of these two genes
cds_levenshtein : the Levenshtein distance between coding sequences of these two genes
yn00_dS : synonymous substitution rate (dS) of this gene pair
yn00_dN : nonsynonymous substitution rate (dN) of this gene pair
More details of how this file is generated and corresponding usage, please check the github repo: codes-used-for-mating-type/4.Investigation of recombination suppression in MAT loci.
###########################################################################################################
DATA-SPECIFIC INFORMATION FOR FILE: "Major TE compositions of MAT locus (4 cereal rust species)" and "Major TE compositions of MAT locus (4 Pt isolates)"
These two files have the five TE families which have highest coverage in each HD/PR/STE3.2-1 proximal regions per haplotypes per species.
Chromosome: The chromosome containing the target regions, chromsomes are named in 'species-isolates-chromosome number-haplotype' order
TE family name: TE family name same as class_if.csv and reclass.csv
Order: Classification of TE family in Order rank.
Coverage: Coverage of this TE family in target regions.
More details of how this file is generated and corresponding usage, please check the github repo: codes-used-for-mating-type
/5. Synteny analysis of MAT loci and flanking regions and codes-used-for-mating-type
/7. Comparison of MAT loci among Pt isolates
###########################################################################################################
## Code/Software
Codes related to generate data/figures are stored in https://github.com/ZhenyanLuo/codes-used-for-mating-type/tree/main, corresponding version information of softwares and packages can be found in each script.
RDP5 project files were generated with RDP5 on windows 8.1.
BEAUti v.2.7.6, BEASTv.2.7.6 were used to process genealogical analysis.