Taxonomy of rhizobia and agrobacteria: 2020

Monday, December 28, 2020

Rhizobium leguminosarum 22

The manuscript has been submitted

The manuscript "Defining the Rhizobium leguminosarum species complex" was submitted to Genes on 11 December. It is available as a preprint at https://www.preprints.org/manuscript/202012.0297/v1, with the DOI 10.20944/preprints202012.0297.v1. Many thanks to all my coauthors, without whose efforts this project would not have been possible. Now we have to wait for reviews, but in the meantime your comments will be very welcome.

Thursday, October 15, 2020

Rhizobium leguminosarum 21

16S: the full story

The 16S ribosomal RNA sequences of the type strains of Rhizobium laguerreae, R. sophorae, R. ruizarguesonis and R. indicum are all identical to that of the type strain of R. leguminosarum. Even the type strain of the sister taxon R. anhuiense has the same sequence. From this, it would be reasonable to guess that all members of the Rlc had this sequence, but the truth is very different. In fact, I found 18 distinct 16S sequences among the available genomes – though these certainly do not correspond to the 18 genospecies. That does not include a further 5 variants that were only found in a single strain and differed by a single nucleotide from a common variant, which I discounted on the grounds that they might be sequencing errors. There were also three genome assemblies that had no 16S sequence, and three more in which it was incomplete – clearly these are errors in the assembly, since 16S is essential.

The 'type' sequence is certainly the predominant one, found in 286 of the 440 genomes, but there are three places in the 16S that have significant levels of polymorphism within the Rlc. Kumar et al. (2015, http://dx.doi.org/10.1098/rsob.140133) found a single polymorphic site in their sample (position 1069 in their numbering, 1151 in my alignment, which includes the IVS). They found this was T in gsA and gsB, C or A in gsC, A in gsD, C in gsE. With a much larger set of genomes, this remains broadly true, though the picture is less clear-cut and the fourth possible nucleotide, G, is also found. The type strains have the C variant. This nucleotide is in a loop, so is not paired in the 16S rRNA secondary structure. The second polymorphism is in a stem, so involves a complementary pair of nucleotides at positions 1023 and 1036 in the alignment. These are T and A in the type sequence, but C and G in all members of gsR (R. laguerreae) except, ironically, the type strain FB206. The C:G variant is also common in other F-clade genospecies, as well as in all gsM strains and one gsL.

The third polymorphism is the long intervening sequence that I discussed in the last post. After publishing that post, I located the reference that had slipped my mind. It is a nice paper from Raúl Rivas’s group in Salamanca, published last year (Flores-Félix et al. 2019, https://doi.org/10.1016/j.syapm.2018.10.009). They found the extra sequence in a number of strains, including three of the eleven genomes that I have just rediscovered it in, and have a very nice discussion of this. If I understand the paper correctly, they found that the IVS is excised in the RNA and the molecule is rejoined – it does not remain split as I imagined. The paper also refers to the literature on IVS in rRNA genes, and reminded me that the first published report in rhizobia (in what is now R. leucaenae) was by Anne Willems and Dave Collins back in 1993 (https://doi.org/10.1099/00207713-43-2-305). I decided that I did not have enough material to write a paper about the IVS I had found in R. leguminosarum in 1991, so I just submitted the sequence to GenBank in 1994 (accession U09271). The 11 genomes that have the IVS are all in the F-clade, but they are not a monophyletic group. Two of the strains have a single nucleotide variant within the IVS, but these strains are not neighbours.

The variation I have just described accounts for 9 of the 18 variants I claimed at the start. The other 9 involve a variety of other locations in the sequence, but occur only in one or two strains each.

I have tried to capture the 16S variation by adding to the phylogeny. Maybe the result is rather complex, but I hope it is more informative than just showing 18 arbitrary symbols for the variants.

Next week, I plan to start writing all this up as a manuscript, so I may not have new analyses to share with you. If anyone wants to try their own analyses (whether or not for potential inclusion in the manuscript), I can provide a link to a folder with all 440 genome sequences

Thursday, October 8, 2020

Rhizobium leguminosarum 20

A 16S flashback

In November 1991, Helen Downer and I were sequencing 16S genes of rhizobia. We used a recently-invented process called PCR (Saiki et al. 1988 http://dx.doi.org/10.1126/science.239.4839.487) and primers Y1 and Y2 that I had designed to amplify the first part of the gene (Young et al. 1991 http://dx.doi.org/10.1128/jb.173.7.2271-2277.1991). Then we sequenced the products by hand using big gels, X-ray film and 32P radioisotope. The PCR product was normally 308-312 bp, but we were intrigued by one pea-nodulating strain, SP18, that gave a much longer product. When we sequenced it, we found that the extra DNA was in a region that was normally conserved. The first stem-loop in the secondary structure of Rhizobium 16S rRNA usually looks like this (taken from my 1991 lab book):

The CCCC….GGGG stem is found in most Rhizobium and in Sinorhizobium. The GCAA loop is even more conserved in most Alphaproteobacteria, but instead of GCAA, strain SP18 had:

TCCTTCAAGCAAGCTTGAAG-ATTTTTATCCTTGGAAAGGAAGATCAAGAAGAGCTTCTAAGAAGCTTTCTTGATGGA

A few months later, I left the John Innes Centre for the University of York and got involved in new projects, so I never published this strange sequence. Last week, I started to look at conservation of the 16S sequence in the 429 Rlc genomes, but was motivated to dig out my old lab records because I saw a similar ‘extra’ sequence in a few genomes. In fact, not just similar, but identical, apart from an additional ‘G’ where I have shown ‘-‘ in the SP18 sequence (almost certainly, this was an error in our manual sequence, which was based on a single read). There are 11 genomes with the extra sequence; they are all in genospecies O, P and Q, but not all genomes in these genospecies have it.

The first 16 bases of this long ‘loop’ sequence are complementary to the last 16 (except a couple of ‘bulges’), so would be expected to extend the stem structure, but what kind of secondary structure would be adopted by the rest of the sequence is unclear. This is what I got when I sent the sequence to an RNA structure prediction site (http://rna.urmc.rochester.edu/RNAstructureWeb/):

The red part at the bottom is the conserved stem shown in the previous figure; the rest of the structure is speculative.

I am hoping that you, my readers, can help me here. I think I have seen publications fairly recently that have described similar ‘long’ sequences in this location of 16S, but I cannot remember where. Can someone point us to relevant papers? I also have a suspicion that the 16S rRNA may be cleaved within this sequence and exist as two disconnected strands within the ribosome, but I can’t remember whether someone else showed that or it was our own unpublished observation of an unexpected pattern of rRNA bands in nucleic acid preps.

All this is something of a digression. I just wanted to record the 16S sequences of all the strains because this is something that taxonomists like to look at, and I thought the result was going to be boring and uninformative. It turns out that there is more 16S sequence variation than I expected. There are also a few genome assemblies with broken 16S sequences or no 16S at all (!), and it is taking me a while to sort those out, so the ‘boring’ consideration of 16S variation will have to wait until the next post.

Wednesday, September 30, 2020

Rhizobium leguminosarum 19

Some more information

Many thanks to everyone who responded so quickly to my request for information on the country and host of origin, and especially to Marta Maluk who not only dealt with her own JHI strains but with many others as well. We now have a fairly complete list, and the few remaining gaps are not too important. The Google Sheet is still here, but if you have some changes to suggest, please let me know directly, because I have already downloaded the current state of the spreadsheet and may not notice any further changes on the Google Sheet. My main aim was to get a sense of whether some genospecies were confined to certain regions or hosts. For those genospecies with many strains, this is not generally true, apart from gsA, which only includes clover symbionts so far, though from various locations.

I have searched the genomes for matches to NodD, NodA and NodC sequences representing the three symbiovars viciae, trifolii and phaseoli. This is a useful complement to documentation of host of origin. There are a few isolates that appear to have lost their symbiosis genes in cultivation between isolation and genome sequencing. This is something that has been observed before – it seems that not all symbiosis plasmids are fully stable in culture.

Here is the phylogeny with the addition of the symbiovar data from this nod-gene search, and the strain names have been added, too.

I have checked the species assigned to all those strains that are included in the GTDB (http://gtdb.ecogenomic.org/). Some of the more recent accessions are not there yet, but there is good agreement for those that are. GTDB divides the Rlc into ten species plus two single-strain ‘species’, lumping together some of the closely related species and unique strains that are borderline but I have argued for keeping separate. For example, they place the whole F-clade in s__Rhizobium laguerreae. There are no direct conflicts between the two schemes, though. Here is the equivalence table.

Genospecies	GTDB_species
anhuiense	s__Rhizobium anhuiense
L	s__Rhizobium leguminosarum_D
M	s__Rhizobium leguminosarum_I
C	s__Rhizobium leguminosarum_C
D + CC278f + Norway	s__Rhizobium leguminosarum_K
E	s__Rhizobium leguminosarum
H	s__Rhizobium leguminosarum_J
A	s__Rhizobium leguminosarum_E
WYCCWR10014	s__Rhizobium sp001657485
Tri-43	s__Rhizobium leguminosarum_M
G	not represented
S	not represented
I	not represented
Q, WSM1689, CCBAU10279, R, P, O, N	s__Rhizobium laguerreae
Vaf12	s__Rhizobium sp005860925
K, J, B	s__Rhizobium leguminosarum_L

Their taxonomy includes three further species that sound as though they ought to be in the Rlc but are actually more distant. Their s__Rhizobium leguminosarum_G covers WSM2297, which is somewhere close to R, hidalgonense. Their s__Rhizobium leguminosarum_A is for OV483, which is so far away that it is not even in the leguminosarum-etli clade. Their s__Rhizobium sophorae is actually R. sophoriradices – an unfortunate mistake that arose because the first version of the R. sophorae genome was not from the right strain.

I can also bring you, hot off the press, my summary figure of the ANI evidence for the 10 genospecies. I have included some of these individual plots in earlier posts, but now we have all 18 plots, in glorious colour. Each plot shows, in rank order, the ANI values for all 440 strains against the reference strain for that genospecies. Larger symbols indicate strains that belong to the genospecies in question, and the colours match the genospecies throughout. It took a few hours of battling with the intricacies of Seaborn FacetGrid to get to this point, but I think the result is pretty.

By the way, the figures in this blog are PNG files that you can download and save (using the right-click menu) so that you can take a closer look at them.

That’s all for now.

Thursday, September 24, 2020

Rhizobium leguminosarum 18

No comment

Last week, I asked my reader(s) for comments on what I had done so far, and ideas for further analysis. So far, I have received no response. Zero. It seems that nobody else is interested in defining the Rlc, and all my readers have deserted me. It may be a single-author publication, after all.

There are some small tasks that I will need help with, such as providing the country of origin and isolation host for every strain – something that the people who submitted the genomes are best placed to do. I have created a Google Sheet here that you can add the information to. If that doesn’t work, or demands that you create a Google account, just let me know and I will email you the file. Suggestions for more sophisticated analyses are also welcome.

Meanwhile, I have refined the list of genomes to incorporate the new ones and eliminate duplicates and erroneous genomes that do not correspond to the strain. That leaves 440 genomes altogether: 429 are Rlc and 11 in the R. anhuiense outgroup. I have repeated the analyses using this final set. I used the colours defined by Cavassim et al. 2020 (https://doi.org/10.1099/mgen.0.000351) for genospecies A to E, and chose colours for the 13 new genospecies. I worked out how to get the ANI plot in the same order as the phylogeny, and to add keys for the genospecies colours. Here are the results.

Fig: Phylogeny based on 120 core genes.

Fig: Pairwise ANI values for all genomes in the Rlc and R. anhuiense, showing genospecies assignment.

Fig: ANI values, as in previous figure, but showing values > 96% in black, 95-96% in grey.

Tuesday, September 15, 2020

Rhizobium leguminosarum 17

Questions for you

So far, I have identified the Rhizobium leguminosarum species complex (Rlc) as a clearly-defined cluster with over 400 genomes that can be split into 18 putative genospecies plus 7 single strains that have no close relatives. I used a phylogeny of 120 core genes made with fasttree, and Average Nucleotide Identity values based on whole genomes calculated with fastANI. What else should we do to make a convincing and useful description of the Rlc? The aim is to define a set of well-supported genospecies that others can readily assign new strains to, and to set clear criteria for defining additional genospecies in the future.

1. Should we make a phylogeny using a different phylogenetic method, or a different set of core genes? If so, which?

2. Should we calculate pairwise genome similarity using a different metric, or different software to calculate ANI?

3. Should we look at all the non-core genes, to identify sets of genospecies-specific genes?

4. Should we look at recombination rates, to see whether these are higher within than between species? If so, how?

5. Should we look at plasmid distributions?

6. Does “species complex” convey the right level of divergence to describe the Rlc? How is the term “species complex” used for other groups of species, and how closely related are the species within them?

7. What about the single strains with no close relatives? Are they just the first known members of additional genospecies, or are they some kind of short-lived ‘hybrid’ between species, or are they genomes that were not well assembled for some reason? How can we tell?

8. What other questions do we need to answer?

The results so far are based on the genomes available from NCBI on 25 July 2020. I have kept an eye on new releases, and there have been an additional 30 genomes labelled “R. leguminosarum”. I have checked them by fastANI, and 19 are in the Rlc, in genospecies A, B, C and E, so I will add them to the final analyses. The other 11 are outside the Rlc, so we can add them to the list of mislabelled strains and forget about them. Here is the list.

R._leguminosarum_DSM_106839_GCF_014202125.1.fna	E
R._leguminosarum_DSM_30141_GCF_014138565.1.fna	E
R._leguminosarum_RCAM0610_GCA_014189555.1.fna	E
R._leguminosarum_RCAM0626_GCA_014189575.1.fna	C
R._leguminosarum_RCAM1365_GCA_014189635.1.fna	A
R._leguminosarum_RCAM2802_GCA_014189655.1.fna	C
R._leguminosarum_SEMIA_4011_GCF_014205785.1.fna	not in Rlc
R._leguminosarum_SEMIA_4016_GCF_014200035.1.fna	not in Rlc
R._leguminosarum_SEMIA_4022_GCF_014200055.1.fna	not in Rlc
R._leguminosarum_SEMIA_4024_GCF_014200075.1.fna	not in Rlc
R._leguminosarum_SEMIA_4025_GCF_014207035.1.fna	not in Rlc
R._leguminosarum_SEMIA_415_GCF_014197955.1.fna	not in Rlc
R._leguminosarum_SEMIA_416_GCF_014197975.1.fna	E
R._leguminosarum_SEMIA_421_GCF_014198005.1.fna	not in Rlc
R._leguminosarum_SEMIA_422_GCF_014198335.1.fna	not in Rlc
R._leguminosarum_SEMIA_430_GCF_014198015.1.fna	not in Rlc
R._leguminosarum_SEMIA_445_GCF_014198115.1.fna	E
R._leguminosarum_SEMIA_449_GCF_014198095.1.fna	E
R._leguminosarum_SEMIA_459_GCF_014198415.1.fna	E
R._leguminosarum_SEMIA_460_GCF_014138515.1.fna	E
R._leguminosarum_SEMIA_463_GCF_014198545.1.fna	E
R._leguminosarum_SEMIA_475_GCF_014198665.1.fna	B
R._leguminosarum_SEMIA_481_GCF_014198655.1.fna	E
R._leguminosarum_SEMIA_482_GCF_014198705.1.fna	not in Rlc
R._leguminosarum_SEMIA_483_GCF_014198695.1.fna	E
R._leguminosarum_SEMIA_485_GCF_014198735.1.fna	E
R._leguminosarum_SEMIA_488_GCF_014206965.1.fna	E
R._leguminosarum_SEMIA_491_GCF_014198795.1.fna	not in Rlc
R._leguminosarum_SEMIA_498_GCF_014198195.1.fna	E
R._leguminosarum_SEMIA_499_GCF_014198835.1.fna	E

There is also a new R. laguerreae, but it is just another version of the type strain under a different name. There is a strain, R. sp. WYCCWR11317, that is a new member of gsS. There is a corrected UPM1135. If anybody knows of other new accessions within the Rlc, or is aware of important new genomes that are just about to be made public, please let me know.

I hope I still have some readers out there to answer these questions, because this is a project that is important for the whole community of researchers who study R. leguminosarum and its relatives, and I would like to create a publication that will have wide support. I look forward to being overwhelmed by all your comments!