Wednesday, August 12, 2020

Rhizobium leguminosarum 3

A preliminary genome-based phylogeny

The most usual way to make a genome-based phylogeny is to select a set of representative genes, concatenate their sequences, and apply standard phylogenetic methods. The genes used should, if possible, be (a) present in all the genomes, (b) present in a single copy per genome, (c) unlikely to have a history of horizontal gene transfer. There need to be enough genes so that the consensus view will outvote any anomalous behaviour of individual genes. Many different sets of genes have been proposed for this purpose and used in the literature. Here are a few.

(1)   In our paper that defined chromids, Peter Harrison found 305 genes that were shared by all chromid-bearing bacteria (Harrison et al. 2010, http://dx.doi.org/10.1016/j.tim.2009.12.010), and we used that set in our first study of Rlc genome diversity (Kumar et al. 2015, http://dx.doi.org/10.1098/rsob.140133) and for a phylogeny of the 196 genomes in our recent work (Cavassim et al. 2020, https://doi.org/10.1099/mgen.0.000351).

(2)   Parks et al. (2018, http://dx.doi.org/10.1038/nbt.4229) used 120 ubiquitous single-copy proteins for a taxonomy covering 94,759 genomes across the whole domain of Bacteria. They set up the Genome Taxonomy Database (GTDB http://gtdb.ecogenomic.org/, described in a recent article by Parks et al., 2020, https://www.nature.com/articles/s41587-020-0501-8), which is interesting because it proposes taxon boundaries based on normalised phylogenetic depth, and recognises a number of “species” within the Rlc. At some point, we will need to check how their classification maps onto the genospecies structure of the Rlc. Their “species” boundaries may prove useful in deciding on ambiguous cases.

(3)   Na et al. (2018, http://dx.doi.org/10.1007/s12275-018-8014-6) defined a set of 92 genes as their up-to-date bacterial core gene set (UBCG) and provide tools for using them (https://www.ezbiocloud.net/tools/ubcg).

Undoubtedly there is considerable overlap among these sets, and undoubtedly there are many other comparable sets proposed in the literature. Any reasonably large set of core genes will probably give a similar phylogeny, so the choice is probably not critical, but it would be good to try more than one.


For a first try, I have used the 120 Parks proteins. However, I used the DNA sequences rather than amino acid sequences, because we want to resolve differences within a single genus, not to construct a tree for all bacteria. I first checked for recent uploads and found that there were now 834 Rhizobium genomes (on 25/07/2020). There were 797 genome assemblies that had all 120 genes, and this is the phylogeny I got:

 

The clade on a yellow background is the genus Rhizobium, as currently defined. I have indicated three major clades within it: tropici, gallicum and leguminosarum/etli. On a green background is the Rhizobium leguminosarum complex (Rlc). The great news is that this is very clearly defined, on a long branch (arrowed), so there is no ambiguity about which strains belong to it and which do not.

From now on, our main focus is going to be on the Rlc, but it is worth spending a little time looking at the rest of this tree. The Rlc has a very clear sister taxon, R. anhuiense, with 11 genomes in a tight group. There are many species and a great deal of complexity in the rest of the Rhizobium genus that needs to be sorted out at some future date. At the moment, the assemblies have the taxon names that are assigned to them in GenBank, and many of these are clearly incorrect. There are many genomes labelled “R. leguminosarum” that are outside the Rlc in other parts of the leguminosarum/etli clade. For example, CF307 is R. anhuiense, WSM2012 is R. hidalgonense, WSM2304 is R. acidisoli, CCGM1 is R. phaseoli. The outgroups in black and grey are other genera in the Rhizobium/Agrobacterium group, but they are very incomplete because these are only the genomes that are mislabelled “Rhizobium” in GenBank! There are many more genomes available under their proper genus names, apart from Allorhizobium, which NCBI has decided does not exist so calls them all Rhizobium. The genus names are as assigned in GTDB. The temporary label g__Rhizobium_A is for a genus that does not currently have a name, but includes straminoryzae, rhizoryzae and pseudoryzae (though not oryzae, which is in Neorhizobium). All these genera need some sorting out now that many genomes are available, but we must not get distracted from the Rlc. In the next post, I’ll consider whether Rhizobium leguminosarum is a species or a species complex.


No comments:

Post a Comment