The Mobile Elements

Wednesday, October 22, 2014

The surprisingly vacant world of usable tools to predict prokaryotic promoters

A few days ago, I needed to find (predicted) promoter regions in a particular closed bacterial genome. Naively, I thought I'd just waltz into the Internet for Bioinformaticians-Who-Just-Want-To-Get-Shit-Done, find several-to-many tools to accomplish the task, and waltz out.

Ah, silly me.

BDGP. DNF. Berkley Drosophila Genome Project. "This server runs the 1999 NNPP version 2.2 (March 1999) of the promoter predictor." Um... Also, eukaryotic.

BPROM. FAIL. My full genome file caused BPROM to return a server error. Tried to run 1/5 of the genome and it hung for 5 minutes before I gave up.

ClassProm. FAIL. Interface available, will accept a file upload, results no longer available.

Dragon Genome Explorer. Link Rot.

PePPER. FAIL. Pasted in my genome file and gene coordinates. Run started and immediately reset to the landing page.

PromAn. Link Rot.

PromBase. OK. Fortunately, my genome is in their database of pre-calculated promoters. Unfortunately, I have to download the entire (500mb) database, and apparently it is hosted on a very slow server (ETA 1-4 hours from my university connection).

PromScan. FAIL. Submitted my genome file, ended up at a 404 Error.

PPP. OK. Requires intergenic region sequences extracted from the genome. Actually yielded a results file, although the data format is horrific.

Virtual Footprint. FAIL. Unable to configure the input settings properly. Seems to require a separate run for each gene of interest, problematic for analyzing an entire genome.

There were a few additional tools that apparently no longer exist, and it would not surprise me if I've missed others. But so far, I'm batting about 150, and not a good 150 either. I'm now faced with a one-off scrape of the PromBase database, or a hefty amount of very specific file conversion scripting for PPP.

Either way, it really cuts into my time saving the world from marauding microbial mobile elements.

Saturday, September 6, 2014

PLoS Resources on Ebola: The Flipboard Choice

Today PLoS (Public Library of Science) announced the creation of a resource that pulls together all Ebola-related articles published in their suite of seven peer-reviewed journals. PLoS articles are and always have been freely available to everyone, so it's not really news that their Ebola research is, well, freely available. Other publishers, more traditionally non-OA, have generously created their own open access Ebola collections (Science and Science - Translational Medicine, for example). This is laudable to a certain degree, but smacks of ambulance chasing. Influenza? TB? Coronavirus? None rise to the level of importance necessary to open the vaunted annals of research?

Anyway, I'm not here to rant about disease research that is funded by US government money should be OA. Except I just did, and it should.

I was surprised and intrigued by PLoS' decision to release their Ebola collection on Flipboard. I consider myself a savvy user of social media in science, but I confess I did not see this coming. Is Flipboard a popular tool for disseminating information in science now? Or is PLoS breaking new ground with this?

Wednesday, August 27, 2014

NCBI and the shiny new FTP site

On Tuesday (August 26, 2014), NCBI announced the launch of their revamped NCBI Genomes FTP site, billing it as a "single entry point to access sequence and annotation content of both GenBank and RefSeq genomes data." If you deal with Web interfaces to FTP sites at all, you probably won't be surprised by it's overall look-and-feel; Figure 1 shows a screenshot of the parent index:

parent directory of NCBI Genomes FTP site

Figure 1. Parent index of the NCBI Genomes FTP site.

If I drill down into the Bacteria folder, I can access the data files for Rickettsia felis URRWXCal2 (Figure 2) in short order:

NCBI Genomes FTP Site subdirectory for Rickettsia felis URRWXCal2

Figure 2. Data directory for Rickettsia felis str. URRWXCal2.

This is nice. Not flashy, mind you - kind of a retro, early aughts feel to it - but nice. Data access is always worth several bonus points.

Improved, but still lacking

I'll use the NCBI Genomes FTP site if I know exactly what genome data I want, and if I only want a few files. But two things detract from the general usefulness of the new site:

tl;dr: The current file paths are ok for human browsing, but inconvenient for bioinformatic access. Every subdirectory of Bacteria/ is entitled with a species name (with underscores replacing whitespaces), followed by a uid, and the actual data files are labeled with Nucleotide Accession (NC) Numbers. A bit of searching revealed the uid is actually the Bioproject ID, as distinct from the Bioproject Accession. This seems an odd choice - if a uid is really necessary in the directory name, why not use the taxon id? I'd wager taxon ids are more widely recognized than Bioproject IDs. This isn't an issue when browsing via a Web browser, obviously: just ignore the uid. But it would simplify access to the data via wget or curl, e.g., as part of an automated script, if the uids were something readily identifiable and fetchable. I have the same minor complaint about the use of species names as subdirectories - how do I know which alternative spelling of the species is used? Again, names are nice for browsing, but they can interfere with bioinformatic access. In order to construct the correct path to the actual data, you need to know, or know how to get, the appropriate species name, the bioproject id, and the NC number. Sigh.
tl;dr: The current directory structure is ok for downloading a few specific files, but really unhelpful for higher-level tasks. My main complaint with the new FTP site is the inability to easily access different logical, and very common, groupings of genomes. Example the First: taxonomy. If I want to download all genomes for the Elusimicrobia, I need to know the exact names of all species therein, including any possible Candidatus, uncultured, or generically labeled genomes (Figure 3). Then I need to download each genome individually. Ouch. Of course, there are bioinformatic solutions to save the day - I could fetch the taxon id for Elusamicrobia, then query for all dependent taxa. But that brings me back to my first complaint, and we've come full circle.

Figure 3. The bottom of the NCBI Genomes FTP Site's Bacteria directory.

Still A Better Way: PATRIC

Despite improvements in the NCBI Genomes FTP Site, I still prefer the PATRIC Genome Download Tool (Figure 4) for my pure data access needs. It allows me quick access to taxonomic groups of genomes, as well as different data file types and annotation schemes (PATRIC/RAST, and RefSeq). And it even compresses the download for faster transfer.

Figure 4. Downloading RefSeq annotations for both Elusimicrobia genomes from PATRIC.

For all its worts, PATRIC is still an easy way to access my genome data. (Full disclosure: I have been employed on the PATRIC contract and contributed to other parts of the PATRIC Web site.)