Haven’t updated in a while

I haven’t updated the DB in a while, mostly because I noticed a bug in code when checking for updates. Yes, once again, NCBI changed the format of their data, making me change the way I parse their data.

Finally did that today, and 4.5K of the genomes had been updated. It’ll take a while to get caught up. There were 300 new genomes so am doing these first, then updating.

Once more unto the breach!


All caught up!

Amazingly, the Codon Usage Bias Database is all caught-up. This hasn’t been the case in years. It is now up-to-date with NCBI’s Microbial Complete Genomes Database.

To understand why this is so unusual, let’s examine the source of genome data for my database. The download process begins with NCBI’s list of Microbial Genomes (Bacterial and Archael). I identify all complete genomes, and pull their annotated files to my server.


There are now 5534 complete genomes in the CUB-DB. There are actually more listed on NCBI, but some have frame shifts that make codon identification impossible, or they have too few Ribosomal Protein Coding Genes which makes many of the CUB computations impossible.

So why is it so difficult to keep up-to-date? New genomes are added all the time. I must run all of the algorithms against these new genomes, and some of them (in particular the GA and mSCCI algorithms: you know, mine) take hours per genome to compute. Add to this that NCBI routinely updates already existing genomes, and you can see that it is a never ending battle to compute both the new gene bias levels as well as keep updated on those that have been modified.

But as of this morning, I am all caught-up. It’ll probably last about a nanosecond!