The genome of the Xingu Scale-backed Antbird (Willisornis vidua nigrigula) reveals lineage-specific adaptations
Mikkelsen, Else K.; Weir, Jason (2020), The genome of the Xingu Scale-backed Antbird (Willisornis vidua nigrigula) reveals lineage-specific adaptations, Dryad, Dataset, https://doi.org/10.5061/dryad.xsj3tx9cq
Antbirds (Thamnophilidae) are a large neotropical family of passerine bird renowned for the ant-following foraging strategies of several members of this clade. The high diversity of antbirds provides ample opportunity for speciation studies, however these studies can be hindered by the lack of an annotated antbird reference genome. In this study, we produced a high-quality annotated reference genome for the Xingu Scale-backed Antbird (Willisornis vidua nigrigula) using 10X Genomics Chromium linked-reads technology. The assembly is 1.09 Gb, with a scaffold N50 of 12.1 Mb and 17,475 annotated protein coding genes. We compare the proteome of W. v. nigrigula to several other passerines, and produce annotations for two additional antbird genomes in order to identify genes under lineage-specific positive selection and gene families with evidence for significant expansions in antbirds. Several of these genes have functions potentially related to the lineage-specific traits of antbirds, including adaptations for thermoregulation in a humid tropical environment.
This dataset provides the genome assembly and annotation of Willisornis poecilinotus vidua from the manuscript "The genome of the common scale-backed Antbird (Willisornis poecilinotus) reveals lineage-specific adaptations". It contains the following files:
1) GFF format annotations of Willisornis vidua nigrigula, Hypocnemis ochrogyna, and Rhegmatorhina melanosticta
- These files are named "Willisornis_vidua.genome_annotation.gff", "Hypocnemis_ochrogyna.genome_annotation.gff", and "Rhegmatorhina_melanosticta.genome_annotation.gff"
- These GFF files contain annotation information for the locations of protein-coding genes in the genome, as well as locations of repeat-masked sequences.
- They also contain the genomic locations of alignments to known protein-coding genes used for protein prediction during the Maker2 pipeline.
- It also contains functional annotation information listing GO terms and functions predicted for the proteins by Interproscan.
- It also contains the locations of raw ab-initio gene predictions (snap_masked and augustus_masked) which were used to produce the final set of protein predictions
2) Protein-coding gene annotations sequences, provided in fasta format for both the protein (Willisornis.proteins.fasta) and transcripts (Willisornis.transcripts.fasta) sequences for Willisornis, Rhegmatorhina, and Hypocnemis.
- Genes for Willisornis vidua nigrigula were named with the prefix "WilPoe" followed by a unique 5-digit number. The Gene identifier is followed by the name of the closest BLAST hit to known proteins in the Swiss-prot or Trembl database.
3) Curated transposable element library in fasta format: "Willisornis_curated_TE_library.fasta"
- The curated elements are named with the prefix "Wilpoe" followed by a 3-digit unique number. This is followed by a "#" symbol, and then the name of the transposable element class, if identified. If identified below the level of class, this identity is given after a "/" symbol.
- The raw uncurated transposable element library is also provided as "Willisornis_uncurated_TE_library.fasta", but it should be used with caution as it contains sequences that were discarded and misclassifications that were corrected during curation.
4) The pseudohaploid reference genome in compressed fasta.gz format: "Willisornis_vidua_nigrigula_JTW1144_700mill.fasta.gz". This is the genome sequenced used for annotation and in all analyses.
5) All the code used to run the pipeline from genome assembly to analysis are provided in a series of markdown files with code and instructions explaining the code. Note that the scripts are kept as they were run in this project, and so paths to programs and folders are hard-coded and would need to be modified to run on a different computer or dataset.
Steps of the pipeline are split into several pages detailing each step in order (steps can not run out of order as they use intermediate files of the previous step):
2.1_Repeat Library Construction
Natural Sciences and Engineering Research Council of Canada, Award: 411293437
Natural Sciences and Engineering Research Council of Canada, Award: RGPIN-2016-06538
Natural Sciences and Engineering Research Council of Canada, Award: 492890