Skip to main content
Dryad

Malinae481 exonic probe set

Cite this dataset

Dobeš, Christoph; Schmickl, Roswitha; Ufimov, Roman (2021). Malinae481 exonic probe set [Dataset]. Dryad. https://doi.org/10.5061/dryad.j3tx95xc0

Abstract

The subtribe Malinae (Rosaceae) comprises close to 1,000 species and up to 30 genera. It is defined, among other characters, by a derived base chromosome number of x = 17. A parsimonious pattern of chromosome breakage and fusion explains the derivation of the x = 17 karyotype from a polyploidisation event of two x = 9 genomes. High collinearity between the genetic and physical maps of Pyrus and Malus as well as their identical karyotypes suggest that the reorganization of the Malinae genome occurred before the divergence of the two genera, and presumably before speciation of the whole subtribe. The preservation of extensive duplicated chromosomal segments in the genome of members of the Malinae, dating back to the genome-wide duplication, necessitates an assessment of the homology of markers used in phylogenetic studies. First, we identified single-copy orthologs and relatively recently diverged paralogs in the genome of Malus domestica. Average sequence divergence among orthologs was 3.65% in a genome-wide comparison of Malus and Pyrus, which is about one third of the average sequence divergence of 9.36% between Maleae and Amygdaleae orthologs and around half of the divergence based on fourfold degenerate site transversion of 8% between paralogs from the most recent WGD within each of these genomes. Based on this information, we developed a strategy to identify in the Malus genome (1) single-copy loci (presumably located within non-duplicated regions, or hidden paralogs due to differential gene loss) and (2) loci only duplicated once (one paralog only present in the genome; avoidance of multi-gene families). We further constrained our locus selection by imposing the minimum divergence between them to 6% as minimum sequence divergence between orthologs and paralogs. Specifically, we blasted 28,695 mRNAs (queries thereafter) in the range of 800–3,000 bp from the Malus domestica ‘Golden Delicious’ (GDDH13) genome v.1.1, downloaded from the Genome Database for Rosaceae, against the Malus domestica GDDH13 genome v.1.1 (subject thereafter) using the nucleotide BLAST search. Default settings were used except for the e-value, which was lowered to 0.00001. In an initial filtering step, we only kept the hits exceeding 70 bp and 10% of query length with ≥80% sequence similarity between query and subject. We then assigned the hits (usually corresponding to exons) to loci based on the criterion that the length of introns separating the hits does not exceed 10,000 bp. Queries that showed hits with >6 loci were not taken into account. In a second, refined filtering step we only kept loci that fulfilled the following criteria: length cover and sequence similarity of the sum of all hits for a particular locus ≥90%, length of single hits ≥100 bp (in accordance with the bait length of 100 bp), intron length ≤1,200 bp, number of loci per query ≤2, sequence divergence among loci ≥6%. We then blasted the 1,280 Malus queries that fulfilled our selection criteria against the Pyrus communis Bartlett DH genome v.2.0 and applied the same selection criteria. 616 of these queries fulfilled them. Subject sequences from the Malus (799) and Pyrus (764) genomes, matching chosen mRNAs and representing full loci, were then extracted and exon-intron boundaries inferred based on the alignments together with the blasted queries. After filtering for identical numbers of loci in both genomes, sequence divergence between exons of single-copy loci in Malus and Pyrus ≤15% and exon length ≥80 bp, we ended up with 713 loci (481 loci if pairs of paralogous loci are treated as single loci), which corresponded to the 546 mRNAs. Extracted sequences were collapsed at ≥95% similarity and used for bait design. The final exonic probe set covers 2,008,479 bp in total. This way, we expect to get baits with good correspondence to the Malinae genomes, i.e. both orthologs and paralogs are targeted.

Funding

FWF Austrian Science Fund, Award: P31512

Czech Science Foundation, Award: 16-15134Y

PRIMUS Research Programme of Charles University, Award: PRIMUS/17/SCI/23

PRIMUS Research Programme of Charles University, Award: PRIMUS/17/SCI/23