Many Drosophila genomes have been sequenced and assembled recently, and many more genome sequencing projects are in progress. However, Drosophila have bacterial, fungal, and protozoan symbionts, and the DNA of these symbionts may be isolated in the process of sequencing Drosophila genomes. Here, we assess how much sequence is isolated from these symbionts and if the sequence contamination affected how these Drosophila genomes were assembled. We do find raw sequence from bacterial symbionts and humans in Drosophila genome sequence traces analyzed. Surprisingly, the four most-common contaminant species were shared among the Drosophila genomes. However, we do not find evidence of bacterial sequences in two published Drosophila genome assemblies.
newcompseq
There is more information in the Pipeline Readme but here is information specific to this program: 1) Blast all the individual 454 reads from the genome to the known drosophila genome assemblies available. In my case I BLASTed each of 4 different files filled with genomic DNA from 4 different Drosophila species to the genomes of Drosophila pseudoobscura pseudoobscura and drosophila persimilis.
The program I used to do this analysis is called newcompseq.plx. I wanted to run the program in parallel so I had to rename some of the temporary files that it opens. The 454 sequence reads that do not align to either one of the genomes are interesting. These interesting sequences are output to the designated file.
bogotana
2) I BLASTed all the interesting sequences in the output from newcompseq.plx to NCBI's database to see what these strange sequences align to.
The program I used to run this was specific to the species I was analyzing. For example the strange sequences in Drosophila miranda's set of 454 reads were BLASTed to NCBI using a program called miranda.pl.
miranda
2) I BLASTed all the interesting sequences in the output from newcompseq.plx to NCBI's database to see what these strange sequences align to.
The program I used to run this was specific to the species I was analyzing. For example the strange sequences in Drosophila miranda's set of 454 reads were BLASTed to NCBI using a program called miranda.pl.
per
2) I BLASTed all the interesting sequences in the output from newcompseq.plx to NCBI's database to see what these strange sequences align to.
The program I used to run this was specific to the species I was analyzing. For example the strange sequences in Drosophila miranda's set of 454 reads were BLASTed to NCBI using a program called miranda.pl.
pseudo
2) I BLASTed all the interesting sequences in the output from newcompseq.plx to NCBI's database to see what these strange sequences align to.
The program I used to run this was specific to the species I was analyzing. For example the strange sequences in Drosophila miranda's set of 454 reads were BLASTed to NCBI using a program called miranda.pl.
BlastStat
3) The messy BLAST output is then parsed and analyzed by a program called BlastStat.pl. This program gives some statistics on what these strange sequences aligned to. It isolates sequences that aligned to things that are not Drosophila or human and puts the output into a file ending in .stat. These strange sequences are placed into another file ending in .weird that is further analyzed. The .weird file actually was made manually by eliminating all of the useful statistics at the top of the .stat files.
compWeird
4) Then the .weird is used to BLAST all the weird sequences to one another to make sure there aren't duplicates within my data set. The program I wrote to do this comparison is called compWeird.plx. compWeird.plx takes longer than you think to finish, so I left myself some time to do this step. compWeird.plx produces a file ending in .single that identifies which sequences are duplicates and which are not.
dupEliminate
5) In order to eliminate the duplicates, I used a program called dupEliminate.pl. This program finds duplicates and choses the longest sequence of that duplicate. This longest sequence is then put into a file ending in .noDupSeqs.
weirdStat
6) All the species that aligned to something strange are then summarized by a program called weirdStat.pl which produces a file ending in .speciesSummary. This program looks for things like which species were found the most often and which were identified only once or twice.