Tuesday, September 8, 2020

Rhizobium leguminosarum 15

Colouring in the phylogeny

 

Now it is time to put together all the fragments of the Rlc phylogeny that we have seen in
recent posts, and see the whole picture. Here it is. 

 


 

 

Each of the 18 coloured sections is one of the potential genospecies we have defined, and there are just 7 strains that do not fit into any of these. To orient you, the genospecies A-E are (moving clockwise round the circle):

gsC: light blue

gsD: light green

gsE: mid green

gsA: pink

gsB: orange

The F-clade (with 5 genospecies) is in shades of brown.

The outgroup, R. anhuiense, is dark grey.

The other genospecies can be identified by comparing this tree to those in the previous posts. It is the same phylogeny.

 

I made this tree with iTOL on the web (https://itol.embl.de/). It is the first time I have used this, but it seems potentially powerful. I have a lot to learn. I tried to export a legend for the colours, but this function did not seem to work, despite selecting the option. The colours were assigned in a hurry and are entirely arbitrary and certainly not the final choice.

The genospecies vary in the number of genomes that they cover from 170 strains in gsC to just 2 in gsJ, gsP and gsS. Of course, this is influenced by sampling bias and may not reflect the relative sizes of the total populations of each species in the world. The 18 genospecies are clades on branches that are well supported and are generally fairly long relative to those within the genospecies, which is good as it means that they have well-defined boundaries.  It is true, though, that the genospecies vary in apparent ‘depth’. Genospecies C starts closer to the common ancestor than other genospecies – one could argue that it should be split up to make it more comparable with the others, though the ANI values do not justify this. If we accept that branch lengths on the phylogeny reflect differences in evolutionary rate, it appears that gsC is evolving relatively slowly, and the F-clade is faster, so a given ANI value reflects more evolutionary time for gsC than for the F-clade. Using ANI as a criterion means basing species on the amount of sequence divergence, rather than on the length of time needed to reach that divergence. I think it can be argued that this is a reasonable choice. On the other hand, the Genome Taxonomy Database (GTDB http://gtdb.ecogenomic.org/) normalises for differences in evolutionary rate and requires the boundaries of each taxonomic level to fall within a certain band of relative distance from the root of the tree to the branch tips. We will consider how GTDB divides up the Rlc in a future post.

 

Eighteen genospecies is a lot for people to get used to. Of course, we could amalgamate some with their neighbours to create a smaller number of genospecies, but this would create units in which some pairwise distances are greater than is usually considered appropriate for members of the same species. We will consider the ANI metrics in the next post.

No comments:

Post a Comment