README for Resolving a phylogenetic hypothesis for parrots: implications from systematics to conservation, systematic review. Provost, Kaiya L, Smith, Brian Tilston, and Joseph, Leo. 11 September 2017 This README file describes the files associated with the above publication. For any questions or comments, please contact Kaiya L. Provost at kprovost@amnh.org. ########################################################################################## # DATA # ########################################################################################## There are six main sections of the "DATA" portion of this README: "GENBANK DATA", "ALIGNMENTS", "INTRASPECIFIC SAMPLING", "NUCLEOTIDE PARTITIONS", "TREES", and "MAP RASTER FILES". ************************************** GENBANK DATA ************************************** 1) "ConcatenatedGbFiles_Parrots_March2017.gb" This large file is all of the GenBank data downloaded for use during this publication. It contains all species of parrots as well as four outgroup species. It was downloaded in March 2017. *************************************** ALIGNMENTS *************************************** This part of the package contains 15 fasta files. These are found in the zip file "Subset_XX_Genes_100bp.fasta Alignment Files.zip". These are: 1) "Subset_01_Genes_100bp.fasta" 2) "Subset_02_Genes_100bp.fasta" 3) "Subset_03_Genes_100bp.fasta" 4) "Subset_04_Genes_100bp.fasta" 5) "Subset_05_Genes_100bp.fasta" 6) "Subset_06_Genes_100bp.fasta" 7) "Subset_07_Genes_100bp.fasta" 8) "Subset_08_Genes_100bp.fasta" 9) "Subset_09_Genes_100bp.fasta" 10) "Subset_10_Genes_100bp.fasta" 11) "Subset_11_Genes_100bp.fasta" 12) "Subset_12_Genes_100bp.fasta" 13) "Subset_13_Genes_100bp.fasta" 14) "Subset_14_Genes_100bp.fasta" 15) "Subset_15_Genes_100bp.fasta" These files are all fasta-format alignments of the full supermatrix used in the publication, with "Subset_01_Genes_100bp.fasta" being the complete supermatric. They represent the 15 gene-subset alignments described in the text. In each subset, species were only retained if they had data for a threshold number of genes, ranging from 1-15 genes. For instance, in "Subset_10_Genes_100bp.fasta", species needed to have data for at least 10 genes to be retained. The fasta files are set up such that for each species, each line represents a different gene. In addition, any bases where all species have no data for that nucleotide (i.e., "N" or "-") are removed. ********************************* INTRASPECIFIC SAMPLING ********************************* This section contains one CSV file: 1) "Intraspecific_Genetic_Sampling_Citations.csv" This file includes a broad list of parrot species and whether or not they have been studied at the intraspecific level. Included are the references used to make this determination. This data was used in part to make Figure 5. ********************************** NUCLEOTIDE PARTITIONS ********************************* This section contains two subsections: "CONFIG FILES" and "PARTITION FILES". ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ CONFIG FILES ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ This part of the package contains 15 config files. They are located in the zip file "RunPartitionFinder_Subset_XX_Genes_rclusterf.cfg Config Files.zip". These are: 1) "RunPartitionFinder_Subset_01_Genes_rclusterf.cfg" 2) "RunPartitionFinder_Subset_02_Genes_rclusterf.cfg" 3) "RunPartitionFinder_Subset_03_Genes_rclusterf.cfg" 4) "RunPartitionFinder_Subset_04_Genes_rclusterf.cfg" 5) "RunPartitionFinder_Subset_05_Genes_rclusterf.cfg" 6) "RunPartitionFinder_Subset_06_Genes_rclusterf.cfg" 7) "RunPartitionFinder_Subset_07_Genes_rclusterf.cfg" 8) "RunPartitionFinder_Subset_08_Genes_rclusterf.cfg" 9) "RunPartitionFinder_Subset_09_Genes_rclusterf.cfg" 10) "RunPartitionFinder_Subset_10_Genes_rclusterf.cfg" 11) "RunPartitionFinder_Subset_11_Genes_rclusterf.cfg" 12) "RunPartitionFinder_Subset_12_Genes_rclusterf.cfg" 13) "RunPartitionFinder_Subset_13_Genes_rclusterf.cfg" 14) "RunPartitionFinder_Subset_14_Genes_rclusterf.cfg" 15) "RunPartitionFinder_Subset_15_Genes_rclusterf.cfg" These are all config files for PartitionFinder2, one each for the 15 gene-threshold subsets made (see "ALIGNMENTS" section). They are set up to run the rclusterf algorithm. ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ PARTITION FILES ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ There are 15 partition files in this part of the package. They are located in the zip file "best_scheme_Subset_XX_Genes_rcluserf.part Partition files.zip". These are: 1) "best_scheme_Subset_01_Genes_rclusterf.part" 2) "best_scheme_Subset_02_Genes_rclusterf.part" 3) "best_scheme_Subset_03_Genes_rclusterf.part" 4) "best_scheme_Subset_04_Genes_rclusterf.part" 5) "best_scheme_Subset_05_Genes_rclusterf.part" 6) "best_scheme_Subset_06_Genes_rclusterf.part" 7) "best_scheme_Subset_07_Genes_rclusterf.part" 8) "best_scheme_Subset_08_Genes_rclusterf.part" 9) "best_scheme_Subset_09_Genes_rclusterf.part" 10) "best_scheme_Subset_10_Genes_rclusterf.part" 11) "best_scheme_Subset_11_Genes_rclusterf.part" 12) "best_scheme_Subset_12_Genes_rclusterf.part" 13) "best_scheme_Subset_13_Genes_rclusterf.part" 14) "best_scheme_Subset_14_Genes_rclusterf.part" 15) "best_scheme_Subset_15_Genes_rclusterf.part" These are the partitions that result when PartitionFinder2 is run on the associated phylip alignments with the associated config files (see "CONFIG FILES" section). Phylip files are not provided but can be easily converted from the files in the "ALIGNMENTS" section. ***************************************** TREES ****************************************** There are two newick files included in this package: 1) "RAxML_AllSubsets_BestTrees_WithBootstraps_100bp_NotParitioned.newick" This is the original non-partitioned newick. It is the maximum likelihood tree found by RAxML, with support values representing the result from 100 bootstraps. 2) "RAxML_AllSubsets_BestTrees_WithBootstraps_100bp_Partitioned.newick" This is the partitioned version of #1. ************************************ MAP RASTER FILES ************************************ This part of the package contains multiple raster files in ASCII format. They are found in the zip file "COMBINED_Parrots_XXXX.asc ASCII raster files.zip". Four of these files were used in the main text: 1) "COMBINED_Parrots_NoWithinSpeciesSampling_NotLeastConcern.asc" This gives the number of species per cell that are both not Last Concern on the IUCN scale and also have not had within-species genetic sampling. This is associated with Figure 4D. 2) "COMBINED_Parrots_Proportion_GenbankSampled.asc" This gives the proportion of species per cell that are sampled on GenBank. This is associated with Figure 4B. 3) "COMBINED_Parrots_Proportion_WithinSpeciesSampling.asc" This gives the proportion of species per cell that have within-species sampling. This is associated with Figure 4C. 4) "COMBINED_Parrots_SpeciesRichness.asc" This gives the number of parrot species per cell. This is associated with Figure 4A. The rest of these ASCII files were not used in the main text: 5) "COMBINED_Parrots_GenbankSampled.asc" This is the number of species per cell that are sampled on GenBank. 6) "COMBINED_Parrots_GenbankSampled_LeastConcern.asc" This is the number of species sampled on GenBank that are also Least Concern according to the IUCN. 7) "COMBINED_Parrots_GenbankSampled_NotLeastConcern.asc" This is the converse of #6, the number of species sampled on GenBank that are not Least Concern according to the IUCN. 8) "COMBINED_Parrots_Proportion_GenbankSampled_LeastConcern.asc" This is as in #6, but the proportion relative to #6 and #7. 9) "COMBINED_Parrots_Proportion_GenbankSampled_NotLeastConcern.asc" This is as in #7, but the proportion relative to #6 and #7. 10) "COMBINED_Parrots_NotGenbankSampled_LeastConcern.asc" This is the number of species lacking sampling on GenBank that are Least Concern. 11) "COMBINED_Parrots_NotGenbankSampled_NotLeastConcern.asc" This is the converse of #10, the number of species lacking sampling on GenBank that are not Least Concern. 12) "COMBINED_Parrots_GenbankSampled_IUCN_Median.asc" This is the median IUCN rating per cell of GenBank sampled species. IUCN status is calculated such that Least Concern is 1 and Critically Endangered is 5, with Near Threatened, Vulnerable, and Endangered at 2, 3, and 4, respectively. 13) "COMBINED_Parrots_NotGenbankSampled_IUCNMedian.asc" This is as in #12, except it is the median IUCN of species not sampled on GenBank. 14) "COMBINED_Parrots_GenbankSampled_IUCN_Mean.asc" This is as in #12, except the mean is calculated rather than the median. 15) "COMBINED_Parrots_WithinSpeciesSampling.asc" This is the number of species with within-species sampling per cell. 16) "COMBINED_Parrots_WithinSpeciesSampling_LeastConcern.asc" This is the number of species with within-species sampling that are also Least Concern per cell. 17) "COMBINED_Parrots_WithinSpeciesSampling_NotLeastConcern.asc" This is the converse of #16, species with within-species sampling that are not Least Concern, per cell. 18) "COMBINED_Parrots_NoWithinSpeciesSampling_LeastConcern.asc" This is the number of species without within-species sampling that are also not Least concern, per cell. ########################################################################################## # SCRIPTS # ########################################################################################## The "SCRIPTS" section of this README contains three subsections: "GENBANK PIPELINE", "MAP SCRIPTS" and "FIGURE MAKING SCRIPTS". ************************************ GENBANK PIPELINE ************************************ These bash scripts, and their associated Python subscripts, form the core basis for downloading GenBank data and turning it into alignments. These are found in the zip file "Genbank Pipeline Main Scripts.zip". For details on individual subscripts, see the section "PIPELINE SUBSCRIPTS". See also the subsection "MISCELLANEOUS SUBSCRIPTS". 1) "RunGenbankPipeline_1.sh" This downloads files off of GenBank, concatenates the resulting .gb files, converts to fasta-format, segregates sequences by individual genes, and removes genes that are not of interest (microsatellites, misc features, gaps, tRNAs, unknown loci, etc.). After running this script, the user must manually go through the genes and rename known homologs (e.g., CYTB and CYTOCHROMEB). Renamed genes must follow the format of CYTB_1.fa, CYTB_2.fa, etc. 2) "RunGenbankPipeline_2.sh" This combines the homologs from the previous steps, removes genes that don't meet a threshold of reads, renames taxa, removes individuals that don't meet a threshold of base pairs, chooses one individual per species, removes genes that don't meet a threshold number of species, and checks for paralogy before aligning. After aligning, alignments should be assessed before proceeding. 3) "RunGenbankPipeline_3.sh" This calculates the presence-absence of genes per species, concatenates by gene into a supermatrix, calculates the missing sequence per species, and creates gene subsets. ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ PIPELINE SUBSCRIPTS ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ This section of the package includes all of the Python subscripts called by "RunGenbankPipeline_1.sh", "RunGenbankPipeline_2.sh", and "RunGenbankPipeline_3.sh". They are found in the zip file "Genbank Pipeline Subscripts.zip" First, five scripts are called by "RunGenbankPipeline_1.sh", in this order: 1) "genbankPull.py" This script takes a query to GenBank and downloads all matches. It is not recommended by the authors to use this script, as without batching (see next entry) some sequences are systematically skipped for unknown reasons. 2) "genbankPull_batch_spp.py" This is the same as genbankPull_batch.py, except that it downloads GenBank sequences in batches by genera and requires a CSV file of target genera. 3) "concatenateGenbankFiles.py" This script takes multiple GenBank formatted files (i.e., ending in .gb) and concatenates them into one single large file. 4) "gb2fasta_bygene_betterDelimiters.py" This converts GenBank formatted files to fasta files. It also adds a flag to the name indicating what gene it represents. 5) "deMultiGene_afterGb2fasta.py" This takes fasta files made in the previous step and partitions it by gene, separating unique genes into their own fasta file. Second, nine scripts are called by "RunGenbankPipeline_2.sh", in this order: 6) "combineHomologousGenes.py" This combines fasta files with the same gene prefix (e.g., GENE_1.fa with GENE_2.fa) into a single fasta file. This is needed when genes have multiple names and spellings (i.e., CYTB with CYTOCHROMEB both being converted to CYTB.fa). 7) "removeDupFasta.py" This script goes through fasta files and deletes duplicate entries. 8) "renameMultinameTaxa_fix.py. This script, which is optional, will rename taxa within fasta files. It requires a CSV file with two columns in the format of NameToBeReplaced,ReplacementName. 9) "fastaNameShortener_newdelimiter.py" This script shortens the names of taxa to only accession number and species. It removes subspecies epithets as well. 10) "chooseBestOfSpecies.py" This script chooses one individual of each species per gene that has the most complete sequences (i.e., the largest number of basepairs without "-" or "N"). In the case of a tie, the first individual in the file is chosen. It removes individuals whose sequences are less than a certain threshold of basepairs (as outlined in "RunGenbankPipeline_2.sh"). 11) "removeTooManyOutgroups.py" This script parses fasta files and prunes out loci that have less than or equal to some number of non-outgroup taxa. It defaults to 3 (i.e., genes must have at least 4 non-outgroup taxa to be retained). It requires a CSV file specifying outgroups. 12) "addRevComplement.py" This script adds the reverse complement of each sequence to a fasta file. 13) "muscleTree.py" This script uses Muscle to check for reverse complementation and extract sequences that are in-phase with each other. It exports a tree of the sequences. If all sequences are homologous, two clades should form and a midpoint-root should separate them. Each clade will be the reverse-complement of the other clade if taken all together. Any genes that fail this pipeline are assumed to have issues with paralogy and should be manually corrected, if possible. 14) "muscleAlign.py" This script uses Muscle to create an aligned fasta file. Last, four scripts are called by "RunGenbankPipeline_3.sh", in this order: 15) "presenceAbsenceMatrixGenes.py" This script generates a CSV file summarizing whether a species has a gene. It is presence-only and not easy to read. For a human-readable presence-absence table, see below. 16) "presenceAbsenceMatrixGenes_pivot.py" This script takes the results of presenceAbsenceMatrixGenes.py and makes it a human-readable presence-absence table by means of a pivot table. 17) "concatenateSpecies.py" This script concatenates alignments by species into a supermatrix. Species that lack a gene have the entire length of that gene coded as missing. It also calculates the missing base pairs per species and exports it. Lastly, it creates a supermatrix with only the species name as an identifier ("shortnames"). 18) "trimMissing.py" This program subsets the full fasta supermatrix so that only species that have at least some threshold number of genes are included. It requires a number for a maximum threshold (i.e., if you give it 4 it will create subsets where species must have 1 gene, 2 genes, 3 genes, 4 genes). ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ MISCELLANEOUS GENBANK SUBSCRIPTS ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Two miscellaneous Python scripts are used outside of the GenBank pipeline: 1) "extractReferencesFromGenbank.py" This Python script takes a GenBank-formatted file (e.g., "ConcatenatedGbFiles_Parrots_March2017.gb") and extracts all of the references from it to a separate document. Lines 13 and 14 must be edited to point toward the folder the GenBank file is in, and the name of the GenBank file itself. 2) "getUniqueGbAccession.py" This Python script takes a folder of fasta-formatted files generated at the end of the RunGenbankPipeline.sh pipeline. It extracts all unique GenBank accession numbers. Line 27 must be edited to point toward the folder the fasta files are in. *************************************** MAP SCRIPTS ************************************** This part of the package contains three R scripts, which are found in the "Map Making Scripts.zip" zip file: 1) "layerAsciiFiles_IUCN.R" This calculates means, medians, and variances between multiple raster files. It is specifically designed to work with raster files representing IUCN status, hence the name, but can be used with any such data. IUCN status here was represented by a 1-5 scale where 1 represented Least Concern species and 5 represented Critically Endangered species, with Near Threatened, Vulnerable, and Endangered species as 2, 3, and 4 respectively. No maps from this script were used in the main text. 2) "layerAsciiFiles.R" This sums multiple raster files cell-by-cell. This was used to calculate species richness per cell and the number of species that were not Least Concern and also not sampled within-species. 3) "rasterDifferences.R" This calculates proportions of species in State 1 vs State 2, for instance sampled vs non-sampled, per cell. ********************************** FIGURE MAKING SCRIPTS ********************************* This part of the package contains five R scripts devoted to making the figures present in the main and supplementary material. They are found in the "Figure Making Scripts.zip" zip file. These are: 1) "Figure1_Figure2_parrotDataPlots.R" 2) "Figure3_FigureS4_compareNodesBetweenTrees.R" 3) "Figure5_traitsOnTrees-bts.R" 4) "FigureS1_FigureS2_makeTreeWithRaxmlBoots.R" 5) "FigureS3_missingDataPerTaxon.R" The name of each script is indicative of which Figures they make (e.g., "Figure3_FigureS4_compareNodesBetweenTrees.R" makes both Figure 3 and Supplementary Figure 4 when run).