Finding the genome sequences
I downloaded all the genome sequences for Rhizobium (taxid 379) from NCBI. I used a very handy script called genbank_get_genomes_by_taxon.py that is distributed as part of the pyani package (http://widdowquinn.github.io/pyani/). When I first tried this on 12 February 2020 there were 682 genomes. By 24 April there were 799. Since then, there have been a few more, but my first analyses use the 24 April set.
Why did I search for Rhizobium and not just Rhizobium leguminosarum? Because the taxonomic labels on database entries cannot be trusted. If we want to be sure of finding all the genomes in the R. leguminosarum cluster (Rlc), we don’t want to miss those labelled (quite legitimately) R. laguerreae or just Rhizobium sp. On the other hand, there are many genomes that are labelled R. leguminosarum although they are way outside the Rlc. These may reflect changes in taxonomy, or simply be misidentified. I assumed that authors would at least get the genus right, although this turned out not to be completely true. The download included an outlier labelled Rhizobium tropici NFR14 (GCA_00300175) that is actually a Bradyrhizobium!
Of the other 798 downloaded genomes, 388 were described as R. leguminosarum, 231 as Rhizobium sp., and the remainder were assigned to 66 other named Rhizobium species, except for one strain that was mysteriously labelled “Pseudomonas sp. SLBN-2”, although the GenBank taxonomy correctly placed it in Rhizobium.
This is an extraordinary wealth of information, particularly for R. leguminosarum, so we should be able to do some serious genome-based taxonomy. First, we need to check whether the genome sequences are consistent with the taxonomic names they have been given. I’ll tackle that in the next post, and we’ll get a first hint as to whether R. leguminosarum is a definable group within the genus.
No comments:
Post a Comment