Getting lucky with plasmid sequencing

Small plasmids still comprise a sequencing “no-man’s land” – too small to justify the ridiculous coverage depth provided even from a MiSeq run, and too large to justify (in most cases) the time, money, and frustration spent trying to primer walk around them. A single PacBio SMRT cell can theoretically provide ample coverage of hundreds of small plasmids simultaneously, but probably requires several different library preps.

So when Komal was trying to diagnose why some Shewanella strains engineered to produce ethanol from glycerol evolved under anaerobic conditions, we thought the best place to start was to sequence the ~14-kb expression vector. Several former graduate students and postdocs tried, but none were successful. Time for a new approach.

The key was not only to determine the actual plasmid sequence, but understand how it changed under different growth conditions. We considered doing PacBio with barcoded samples, but assembly was still an open question given the sequence similarity between the plasmids. A MiSeq run looked more promising, but was still going to cost ~$500 per plasmid.

Enter Chi: some Geobacter mutants and a few unrelated plasmids also needed resequencing, and so now we were down to $130 per sample, including library prep. Here’s what we did:

1. Estimate target genome resequencing coverage based on MiSeq 2 x 250 single lane throughput and reference genome size to determine number of barcodes. With 6 barcodes we calculated about ~150x coverage of the Geobacter genome (3.8-Mbp).

2. Assign one gDNA template and two (or more) sufficiently unrelated plasmids, preferably of different size, to each barcode. Ideally the gDNA should also be unrelated to the plasmids.

3. Pool DNA for each barcoded in a 1:1:1 molar ratio (at the recommendation of UMGC for Nextera library prep).

Bioinformatically, we had a few options. We could do the resequencing first, filter the resulting BAM file for unmapped reads, and assemble that subset. I thought SPAdes would be a good option, but I ran into problems when it attempted to assemble a microbe-sized genome out of a total of ~20 kb worth of plasmid.

A much better option: after read QC and adapter trimming, I dumped the entire read set into SPAdes, using the complete Geobacter reference genome sequence as a trusted contig. This approach worked splendidly, as the assembly came back in only 23 pieces, with only a handful as potential plasmids:

spades_results

But which of the smaller scaffolds were actually our plasmids? I ran nucmer with the SPAdes scaffolds as the query, and the Geobacter genome as the reference, and simply looked for scaffolds (i.e., nodes) that did not share any nucleotide identity. This approach only works if the plasmid is completely unrelated to the reference organism, which was the case for our samples. In this example, NODE_9 and NODE_11 came back as our plasmids.

Even better, they were circular when checked with a dot plot:

plasmid_dotplot

Breseq confirmed that the plasmid assemblies were correct, and after some manual curation (circularization and reorientation), prokka made quick work of the annotation.

So, bottom line: if you have some plasmids you’ve been meaning to sequence, and you have a quality reference genome for resequencing some engineered or evolved strains of interest, a MiSeq lane and the –trusted-contigs option in SPAdes might just be your new best friends!