The Mobile Elements: August 2014

On Tuesday (August 26, 2014), NCBI announced the launch of their revamped NCBI Genomes FTP site, billing it as a "single entry point to access sequence and annotation content of both GenBank and RefSeq genomes data." If you deal with Web interfaces to FTP sites at all, you probably won't be surprised by it's overall look-and-feel; Figure 1 shows a screenshot of the parent index:

parent directory of NCBI Genomes FTP site

Figure 1. Parent index of the NCBI Genomes FTP site.

If I drill down into the Bacteria folder, I can access the data files for Rickettsia felis URRWXCal2 (Figure 2) in short order:

NCBI Genomes FTP Site subdirectory for Rickettsia felis URRWXCal2

Figure 2. Data directory for Rickettsia felis str. URRWXCal2.

This is nice. Not flashy, mind you - kind of a retro, early aughts feel to it - but nice. Data access is always worth several bonus points.

Improved, but still lacking

I'll use the NCBI Genomes FTP site if I know exactly what genome data I want, and if I only want a few files. But two things detract from the general usefulness of the new site:

tl;dr: The current file paths are ok for human browsing, but inconvenient for bioinformatic access. Every subdirectory of Bacteria/ is entitled with a species name (with underscores replacing whitespaces), followed by a uid, and the actual data files are labeled with Nucleotide Accession (NC) Numbers. A bit of searching revealed the uid is actually the Bioproject ID, as distinct from the Bioproject Accession. This seems an odd choice - if a uid is really necessary in the directory name, why not use the taxon id? I'd wager taxon ids are more widely recognized than Bioproject IDs. This isn't an issue when browsing via a Web browser, obviously: just ignore the uid. But it would simplify access to the data via wget or curl, e.g., as part of an automated script, if the uids were something readily identifiable and fetchable. I have the same minor complaint about the use of species names as subdirectories - how do I know which alternative spelling of the species is used? Again, names are nice for browsing, but they can interfere with bioinformatic access. In order to construct the correct path to the actual data, you need to know, or know how to get, the appropriate species name, the bioproject id, and the NC number. Sigh.
tl;dr: The current directory structure is ok for downloading a few specific files, but really unhelpful for higher-level tasks. My main complaint with the new FTP site is the inability to easily access different logical, and very common, groupings of genomes. Example the First: taxonomy. If I want to download all genomes for the Elusimicrobia, I need to know the exact names of all species therein, including any possible Candidatus, uncultured, or generically labeled genomes (Figure 3). Then I need to download each genome individually. Ouch. Of course, there are bioinformatic solutions to save the day - I could fetch the taxon id for Elusamicrobia, then query for all dependent taxa. But that brings me back to my first complaint, and we've come full circle.

Figure 3. The bottom of the NCBI Genomes FTP Site's Bacteria directory.

Still A Better Way: PATRIC

Despite improvements in the NCBI Genomes FTP Site, I still prefer the PATRIC Genome Download Tool (Figure 4) for my pure data access needs. It allows me quick access to taxonomic groups of genomes, as well as different data file types and annotation schemes (PATRIC/RAST, and RefSeq). And it even compresses the download for faster transfer.

Figure 4. Downloading RefSeq annotations for both Elusimicrobia genomes from PATRIC.

For all its worts, PATRIC is still an easy way to access my genome data. (Full disclosure: I have been employed on the PATRIC contract and contributed to other parts of the PATRIC Web site.)

The Mobile Elements

Wednesday, August 27, 2014

NCBI and the shiny new FTP site

Improved, but still lacking

Still A Better Way: PATRIC