Data from: Identification and qualification of 500 nuclear, single-copy, orthologous genes for the Eupulmonata (Gastropoda) using transcriptome sequencing and exon capture

Teasdale, Luisa C.1; Köhler, Frank2; Murray, Kevin D.3; O'Hara, Tim4; Moussalli, Adnan4

Published May 24, 2016 on Dryad. https://doi.org/10.5061/dryad.fn627

Data files

May 24, 2016 version files 466.75 MB

Agalma_best_alignments.tar.gz

756.63 KB
Agalma_equivalent.tar.gz

653.84 KB
all_exons_concat_cam.phy

10.78 MB
Manual_curation_500_genes_seperated_into_exons.tar.gz

2.90 MB
Manual_curation_500_genes.tar.gz

2.83 MB
Sphaerospira-Austrochlotitis-120-60-v2_FINAL.fas

2.83 MB
Trinity_assemblies.tar.gz

446 MB

Abstract

The qualification of orthology is a significant challenge when developing large, multiloci phylogenetic data sets from assembled transcripts. Transcriptome assemblies have various attributes, such as fragmentation, frameshifts and mis-indexing, which pose problems to automated methods of orthology assessment. Here, we identify a set of orthologous single-copy genes from transcriptome assemblies for the land snails and slugs (Eupulmonata) using a thorough approach to orthology determination involving manual alignment curation, gene tree assessment and sequencing from genomic DNA. We qualified the orthology of 500 nuclear, protein-coding genes from the transcriptome assemblies of 21 eupulmonate species to produce the most complete phylogenetic data matrix for a major molluscan lineage to date, both in terms of taxon and character completeness. Exon capture targeting 490 of the 500 genes (those with at least one exon >120 bp) from 22 species of Australian Camaenidae successfully captured sequences of 2825 exons (representing all targeted genes), with only a 3.7% reduction in the data matrix due to the presence of putative paralogs or pseudogenes. The automated pipeline Agalma retrieved the majority of the manually qualified 500 single-copy gene set and identified a further 375 putative single-copy genes, although it failed to account for fragmented transcripts resulting in lower data matrix completeness when considering the original 500 genes. This could potentially explain the minor inconsistencies we observed in the supported topologies for the 21 eupulmonate species between the manually curated and ‘Agalma-equivalent’ data set (sharing 458 genes). Overall, our study confirms the utility of the 500 gene set to resolve phylogenetic relationships at a range of evolutionary depths and highlights the importance of addressing fragmentation at the homolog alignment stage for probe design.

'Agalma best' alignments

The alignments representing a subset of the output of Agalma, run on 21 eupulmonate transcriptomes. This subset is the 546 orthologous clusters identified by Agalma, where each orthologous cluster was the only one produced from the respective homolog cluster and had sequences for at least 18 taxa. The alignments contain dummy sequences for missing taxa.

Agalma_best_alignments.tar.gz

'Agalma equivalent' alignments

The alignments representing a subset of the output of Agalma, run on 21 eupulmonate transcriptomes. This subset is the 635 orthologous clusters identified by the automated pipeline Agalma, which correspond to the 500 nuclear single copy, orthologous genes identified by manual curation. The alignments contain dummy sequences for missing taxa.

Agalma_equivalent.tar.gz

Manual curation: 500 gene alignments

The alignments for 500 single copy, orthologous, nuclear genes across 21 representatives of the eupulmonates. Orthology was assessed through manual curation and gene tree assessment. Each alignment contains a mask, 'x' denotes regions that were masked out (i.e. remove from further analyses). The alignments contain dummy sequences for missing taxa.

Manual_curation_500_genes.tar.gz

Camaenidae alignment

The concatenated alignment of the 2,648 exons which were sequenced from representatives of the family Camaenidae using exon capture. This alignment was used to produce the camaenidae phylogeny presented in the paper.

all_exons_concat_cam.phy

Camaenidae_exon_capture_probe_set

This file contains the probes for the Camaenidae exon capture design. These probes target exons from 490 orthologous genes. The probes were designed for use with the Mycroarray Mybaits custom kit which consists of 120 bp RNA probes.

Sphaerospira-Austrochlotitis-120-60-v2_FINAL.fas

Manual_curation_500_genes_seperated_into_exons

This file contains the alignments for the 500 manually curated genes seperated out into alignments per exon based on the exon boundaries from the Lottia gigantea genome. These alignments contain the regions which are masked out in the gene alignments but the mask is not presented.

Trinity_assemblies

Transcriptome assemblies for 21 eupulmonate species. The transcriptomes were assembled using the program Trinity.