Wednesday, September 30, 2020

Rhizobium leguminosarum 19

Some more information

 

Many thanks to everyone who responded so quickly to my request for information on the country and host of origin, and especially to Marta Maluk who not only dealt with her own JHI strains but with many others as well. We now have a fairly complete list, and the few remaining gaps are not too important. The Google Sheet is still here, but if you have some changes to suggest, please let me know directly, because I have already downloaded the current state of the spreadsheet and may not notice any further changes on the Google Sheet. My main aim was to get a sense of whether some genospecies were confined to certain regions or hosts. For those genospecies with many strains, this is not generally true, apart from gsA, which only includes clover symbionts so far, though from various locations.

I have searched the genomes for matches to NodD, NodA and NodC sequences representing the three symbiovars viciae, trifolii and phaseoli. This is a useful complement to documentation of host of origin. There are a few isolates that appear to have lost their symbiosis genes in cultivation between isolation and genome sequencing. This is something that has been observed before – it seems that not all symbiosis plasmids are fully stable in culture.

 

Here is the phylogeny with the addition of the symbiovar data from this nod-gene search, and the strain names have been added, too.

 



I have checked the species assigned to all those strains that are included in the GTDB (http://gtdb.ecogenomic.org/). Some of the more recent accessions are not there yet, but there is good agreement for those that are. GTDB divides the Rlc into ten species plus two single-strain ‘species’, lumping together some of the closely related species and unique strains that are borderline but I have argued for keeping separate. For example, they place the whole F-clade in s__Rhizobium laguerreae. There are no direct conflicts between the two schemes, though.  Here is the equivalence table.

 

Genospecies

GTDB_species

anhuiense

s__Rhizobium anhuiense

L

s__Rhizobium leguminosarum_D

M

s__Rhizobium leguminosarum_I

C

s__Rhizobium leguminosarum_C

D + CC278f + Norway

s__Rhizobium leguminosarum_K

E

s__Rhizobium leguminosarum

H

s__Rhizobium leguminosarum_J

A

s__Rhizobium leguminosarum_E

WYCCWR10014

s__Rhizobium sp001657485

Tri-43

s__Rhizobium leguminosarum_M

G

not represented

S

not represented

I

not represented

Q, WSM1689, CCBAU10279, R, P, O, N

s__Rhizobium laguerreae

Vaf12

s__Rhizobium sp005860925

K, J, B

s__Rhizobium leguminosarum_L

 

Their taxonomy includes three further species that sound as though they ought to be in the Rlc but are actually more distant. Their s__Rhizobium leguminosarum_G covers WSM2297, which is somewhere close to R, hidalgonense. Their s__Rhizobium leguminosarum_A is for OV483, which is so far away that it is not even in the leguminosarum-etli clade. Their s__Rhizobium sophorae is actually R. sophoriradices – an unfortunate mistake that arose because the first version of the R. sophorae genome was not from the right strain.

 

I can also bring you, hot off the press, my summary figure of the ANI evidence for the 10 genospecies. I have included some of these individual plots in earlier posts, but now we have all 18 plots, in glorious colour. Each plot shows, in rank order, the ANI values for all 440 strains against the reference strain for that genospecies. Larger symbols indicate strains that belong to the genospecies in question, and the colours match the genospecies throughout. It took a few hours of battling with the intricacies of Seaborn FacetGrid to get to this point, but I think the result is pretty.





By the way, the figures in this blog are PNG files that you can download and save (using the right-click menu) so that you can take a closer look at them.

 

That’s all for now.

Thursday, September 24, 2020

Rhizobium leguminosarum 18

No comment

 

Last week, I asked my reader(s) for comments on what I had done so far, and ideas for further analysis. So far, I have received no response. Zero. It seems that nobody else is interested in defining the Rlc, and all my readers have deserted me. It may be a single-author publication, after all.

 

There are some small tasks that I will need help with, such as providing the country of origin and isolation host for every strain – something that the people who submitted the genomes are best placed to do. I have created a Google Sheet here that you can add the information to. If that doesn’t work, or demands that you create a Google account, just let me know and I will email you the file. Suggestions for more sophisticated analyses are also welcome.

 

Meanwhile, I have refined the list of genomes to incorporate the new ones and eliminate duplicates and erroneous genomes that do not correspond to the strain. That leaves 440 genomes altogether: 429 are Rlc and 11 in the R. anhuiense outgroup. I have repeated the analyses using this final set. I used the colours defined by Cavassim et al. 2020 (https://doi.org/10.1099/mgen.0.000351) for genospecies A to E, and chose colours for the 13 new genospecies. I worked out how to get the ANI plot in the same order as the phylogeny, and to add keys for the genospecies colours. Here are the results.

 

 

 

 

 


Fig: Phylogeny based on 120 core genes.

 


Fig: Pairwise ANI values for all genomes in the Rlc and R. anhuiense, showing genospecies assignment.

 

 

 

 

 


Fig: ANI values, as in previous figure, but showing values > 96% in black, 95-96% in grey.

Tuesday, September 15, 2020

Rhizobium leguminosarum 17

Questions for you

 

So far, I have identified the Rhizobium leguminosarum species complex (Rlc) as a clearly-defined cluster with over 400 genomes that can be split into 18 putative genospecies plus 7 single strains that have no close relatives. I used a phylogeny of 120 core genes made with fasttree, and Average Nucleotide Identity values based on whole genomes calculated with fastANI. What else should we do to make a convincing and useful description of the Rlc? The aim is to define a set of well-supported genospecies that others can readily assign new strains to, and to set clear criteria for defining additional genospecies in the future.

 

1.     Should we make a phylogeny using a different phylogenetic method, or a different set of core genes? If so, which?

2.     Should we calculate pairwise genome similarity using a different metric, or different software to calculate ANI?

3.     Should we look at all the non-core genes, to identify sets of genospecies-specific genes?

4.     Should we look at recombination rates, to see whether these are higher within than between species? If so, how?

5.     Should we look at plasmid distributions?

6.     Does “species complex” convey the right level of divergence to describe the Rlc? How is the term “species complex” used for other groups of species, and how closely related are the species within them?

7.     What about the single strains with no close relatives? Are they just the first known members of additional genospecies, or are they some kind of short-lived ‘hybrid’ between species, or are they genomes that were not well assembled for some reason? How can we tell?

8.     What other questions do we need to answer?

 

The results so far are based on the genomes available from NCBI on 25 July 2020. I have kept an eye on new releases, and there have been an additional 30 genomes labelled “R. leguminosarum”. I have checked them by fastANI, and 19 are in the Rlc, in genospecies A, B, C and E, so I will add them to the final analyses. The other 11 are outside the Rlc, so we can add them to the list of mislabelled strains and forget about them. Here is the list.

 

R._leguminosarum_DSM_106839_GCF_014202125.1.fna

E

R._leguminosarum_DSM_30141_GCF_014138565.1.fna

E

R._leguminosarum_RCAM0610_GCA_014189555.1.fna

E

R._leguminosarum_RCAM0626_GCA_014189575.1.fna

C

R._leguminosarum_RCAM1365_GCA_014189635.1.fna

A

R._leguminosarum_RCAM2802_GCA_014189655.1.fna

C

R._leguminosarum_SEMIA_4011_GCF_014205785.1.fna

not in Rlc

R._leguminosarum_SEMIA_4016_GCF_014200035.1.fna

not in Rlc

R._leguminosarum_SEMIA_4022_GCF_014200055.1.fna

not in Rlc

R._leguminosarum_SEMIA_4024_GCF_014200075.1.fna

not in Rlc

R._leguminosarum_SEMIA_4025_GCF_014207035.1.fna

not in Rlc

R._leguminosarum_SEMIA_415_GCF_014197955.1.fna

not in Rlc

R._leguminosarum_SEMIA_416_GCF_014197975.1.fna

E

R._leguminosarum_SEMIA_421_GCF_014198005.1.fna

not in Rlc

R._leguminosarum_SEMIA_422_GCF_014198335.1.fna

not in Rlc

R._leguminosarum_SEMIA_430_GCF_014198015.1.fna

not in Rlc

R._leguminosarum_SEMIA_445_GCF_014198115.1.fna

E

R._leguminosarum_SEMIA_449_GCF_014198095.1.fna

E

R._leguminosarum_SEMIA_459_GCF_014198415.1.fna

E

R._leguminosarum_SEMIA_460_GCF_014138515.1.fna

E

R._leguminosarum_SEMIA_463_GCF_014198545.1.fna

E

R._leguminosarum_SEMIA_475_GCF_014198665.1.fna

B

R._leguminosarum_SEMIA_481_GCF_014198655.1.fna

E

R._leguminosarum_SEMIA_482_GCF_014198705.1.fna

not in Rlc

R._leguminosarum_SEMIA_483_GCF_014198695.1.fna

E

R._leguminosarum_SEMIA_485_GCF_014198735.1.fna

E

R._leguminosarum_SEMIA_488_GCF_014206965.1.fna

E

R._leguminosarum_SEMIA_491_GCF_014198795.1.fna

not in Rlc

R._leguminosarum_SEMIA_498_GCF_014198195.1.fna

E

R._leguminosarum_SEMIA_499_GCF_014198835.1.fna

E

 

 

There is also a new R. laguerreae, but it is just another version of the type strain under a different name. There is a strain, R. sp. WYCCWR11317, that is a new member of gsS. There is a corrected UPM1135. If anybody knows of other new accessions within the Rlc, or is aware of important new genomes that are just about to be made public, please let me know.

 

I hope I still have some readers out there to answer these questions, because this is a project that is important for the whole community of researchers who study R. leguminosarum and its relatives, and I would like to create a publication that will have wide support. I look forward to being overwhelmed by all your comments!

Friday, September 11, 2020

Rhizobium leguminosarum 16

 

Average Nucleotide Identity

 

So far, I have only shown ANI values using selected reference strains. Now that we have some potential genospecies that look reasonable in the phylogeny, it is time to see how well they are supported by ANI. I set fastANI running on my trusty iMac to calculate all the ANIs between the 424 distinct genomes in the Rlc. Almost 12 hours later, it came back with 179776 numbers. Here they are:

 

 



I was not able to get the strains in the same order as in the phylogeny presented last time, but the order is based on the same phylogeny (it is the order of strains in the Newick file that describes the phylogeny). Yellow and red colours indicate ANI > 96%, blue colours are ANI < 95%, while values in the range 95-96% are greenish. You can see that the genospecies that we defined in the phylogeny stand out as orange-red squares in this ANI plot. I have marked the larger ones – the rest are there, but too small to label at this scale. You can see that strains generally have low (blue) ANI with members of other genospecies. You can also see that the F-clade looks a little less well resolved, as it also did in the phylogeny. Rhizobium anhuiense is very clearly an outgroup, dark blue with all the Rlc strains.

 

We found that members of a genospecies generally had ANI of 96% or above with the representative strain. Here is the same set of ANI values, but shown with a threshold at 96%:

 

 


 

 

 

Now the genospecies are very clear – they are solid red squares on a clean background. Two strains just above gsB are an exception: they exceed 96% ANI with about half the gsB strains. These are WSM1455 and WSM1481 (gsJ), which we had already noted as very close to gsB (Rhizobium leguminosarum 11). There are also two strains in the F-clade that have ANI>96% to all members of both gsQ and gsR, although these genospecies are otherwise distinct. We saw this issue earlier, in Rhizobium leguminosarum 13: “we have already assigned SPF2A11 and HP3 to gsQ. In fact, they have ANI > 96 to the reference strains of both clade Q and clade R”. The phylogeny places them in gsQ, but it would be good to know how robust this is, and why these strains have such high ANI with two sets of strains that are otherwise distinct.
 
The threshold of 96% ANI was chosen because it represented natural gaps in the data, but it is at the top end of the range (95-96%) that taxonomists usually consider to be appropriate for separating species. What would it look like if we set a threshold of 95% instead? Here we are:

 

 

 

 

 

This looks a lot messier. There is partial overlap between gsD and gsE. Genospecies B has swallowed gsJ, gsK and gsI (R. indicum), but with some internal gaps. The F-clade has coalesced into a single group, but with a lot of missing internal points. There are a few more red dots scattered on the background. The other small genospecies are still very distinct. If we adopted this lower threshold, we could reduce the number of genospecies that we defined within the Rlc, though there would still be at least ten, and we would create numerous ambiguities and anomalies. It looks to me that ANI>96% gives a much clearer picture that reflects some real structure in the data, and we just have to accept that there are 18+ genospecies in the Rlc.


Tuesday, September 8, 2020

Rhizobium leguminosarum 15

Colouring in the phylogeny

 

Now it is time to put together all the fragments of the Rlc phylogeny that we have seen in
recent posts, and see the whole picture. Here it is. 

 


 

 

Each of the 18 coloured sections is one of the potential genospecies we have defined, and there are just 7 strains that do not fit into any of these. To orient you, the genospecies A-E are (moving clockwise round the circle):

gsC: light blue

gsD: light green

gsE: mid green

gsA: pink

gsB: orange

The F-clade (with 5 genospecies) is in shades of brown.

The outgroup, R. anhuiense, is dark grey.

The other genospecies can be identified by comparing this tree to those in the previous posts. It is the same phylogeny.

 

I made this tree with iTOL on the web (https://itol.embl.de/). It is the first time I have used this, but it seems potentially powerful. I have a lot to learn. I tried to export a legend for the colours, but this function did not seem to work, despite selecting the option. The colours were assigned in a hurry and are entirely arbitrary and certainly not the final choice.

The genospecies vary in the number of genomes that they cover from 170 strains in gsC to just 2 in gsJ, gsP and gsS. Of course, this is influenced by sampling bias and may not reflect the relative sizes of the total populations of each species in the world. The 18 genospecies are clades on branches that are well supported and are generally fairly long relative to those within the genospecies, which is good as it means that they have well-defined boundaries.  It is true, though, that the genospecies vary in apparent ‘depth’. Genospecies C starts closer to the common ancestor than other genospecies – one could argue that it should be split up to make it more comparable with the others, though the ANI values do not justify this. If we accept that branch lengths on the phylogeny reflect differences in evolutionary rate, it appears that gsC is evolving relatively slowly, and the F-clade is faster, so a given ANI value reflects more evolutionary time for gsC than for the F-clade. Using ANI as a criterion means basing species on the amount of sequence divergence, rather than on the length of time needed to reach that divergence. I think it can be argued that this is a reasonable choice. On the other hand, the Genome Taxonomy Database (GTDB http://gtdb.ecogenomic.org/) normalises for differences in evolutionary rate and requires the boundaries of each taxonomic level to fall within a certain band of relative distance from the root of the tree to the branch tips. We will consider how GTDB divides up the Rlc in a future post.

 

Eighteen genospecies is a lot for people to get used to. Of course, we could amalgamate some with their neighbours to create a smaller number of genospecies, but this would create units in which some pairwise distances are greater than is usually considered appropriate for members of the same species. We will consider the ANI metrics in the next post.