Friday, September 11, 2020

Rhizobium leguminosarum 16

 

Average Nucleotide Identity

 

So far, I have only shown ANI values using selected reference strains. Now that we have some potential genospecies that look reasonable in the phylogeny, it is time to see how well they are supported by ANI. I set fastANI running on my trusty iMac to calculate all the ANIs between the 424 distinct genomes in the Rlc. Almost 12 hours later, it came back with 179776 numbers. Here they are:

 

 



I was not able to get the strains in the same order as in the phylogeny presented last time, but the order is based on the same phylogeny (it is the order of strains in the Newick file that describes the phylogeny). Yellow and red colours indicate ANI > 96%, blue colours are ANI < 95%, while values in the range 95-96% are greenish. You can see that the genospecies that we defined in the phylogeny stand out as orange-red squares in this ANI plot. I have marked the larger ones – the rest are there, but too small to label at this scale. You can see that strains generally have low (blue) ANI with members of other genospecies. You can also see that the F-clade looks a little less well resolved, as it also did in the phylogeny. Rhizobium anhuiense is very clearly an outgroup, dark blue with all the Rlc strains.

 

We found that members of a genospecies generally had ANI of 96% or above with the representative strain. Here is the same set of ANI values, but shown with a threshold at 96%:

 

 


 

 

 

Now the genospecies are very clear – they are solid red squares on a clean background. Two strains just above gsB are an exception: they exceed 96% ANI with about half the gsB strains. These are WSM1455 and WSM1481 (gsJ), which we had already noted as very close to gsB (Rhizobium leguminosarum 11). There are also two strains in the F-clade that have ANI>96% to all members of both gsQ and gsR, although these genospecies are otherwise distinct. We saw this issue earlier, in Rhizobium leguminosarum 13: “we have already assigned SPF2A11 and HP3 to gsQ. In fact, they have ANI > 96 to the reference strains of both clade Q and clade R”. The phylogeny places them in gsQ, but it would be good to know how robust this is, and why these strains have such high ANI with two sets of strains that are otherwise distinct.
 
The threshold of 96% ANI was chosen because it represented natural gaps in the data, but it is at the top end of the range (95-96%) that taxonomists usually consider to be appropriate for separating species. What would it look like if we set a threshold of 95% instead? Here we are:

 

 

 

 

 

This looks a lot messier. There is partial overlap between gsD and gsE. Genospecies B has swallowed gsJ, gsK and gsI (R. indicum), but with some internal gaps. The F-clade has coalesced into a single group, but with a lot of missing internal points. There are a few more red dots scattered on the background. The other small genospecies are still very distinct. If we adopted this lower threshold, we could reduce the number of genospecies that we defined within the Rlc, though there would still be at least ten, and we would create numerous ambiguities and anomalies. It looks to me that ANI>96% gives a much clearer picture that reflects some real structure in the data, and we just have to accept that there are 18+ genospecies in the Rlc.


No comments:

Post a Comment