Getting microbial genome (re)sequencing right

Whether you’re knocking out genes or hunting for SNPs, it’s pretty hard to do bacterial physiology without reliable reference genomes. These days, a few hundred bucks gets you a decent quality draft genome, and if you’re OK with it being in a few dozen pieces, you’re all set.

Two reasons why this is a bad idea:

  • Draft genomes suck.
  • Complete, finished quality genomes are now cost-competitive with draft, and can be had for well under $1,000.

When I joined the Bond Lab, Daniel was frustrated that “next-gen” sequencing fell short when it came to assembling genomes from new isolates or finding elusive transposon junctions in Geobacter. And so I dove into this nebulous technology called PacBio, relying on Twitter as a primary source for help from a small but determined band of believers.

To be fair, though I can’t say enough good things about PacBio, it can still be a steep learning curve if you’re new to the long read game. So let me hopefully save you some time by summarizing my “best practices” for de novo microbial genome assembly and resequencing. Required disclaimer: these suggestions are my own and you should consult with PacBio and/or Illumina for any specific experimental considerations.

  1. Get high quality DNA from cells in stationary phase. Bastien will happily explain to you why this is important.
  2. Perform size selection on the PacBio library to increase your average read length. Lex Nederbragt has some good experience with this.
  3. Sequence to 100x depth or higher with PacBio reads, and 50-100x Illumina reads (need not be paired-end). More on the Illumina reads later. The current P6 PacBio chemistry (and 240-minute movie) routinely provides 150-200x coverage of a 4-Mbp bacterial genome in just a single SMRT cell.
  4. Talk to your compute cluster to install SMRT Analysis, or fire up an Amazon EC2 instance.
  5. Run RS_Subreads to determine the read length cutoff that will provide 100x coverage to HGAP.
  6. Run HGAP_3, the latest (and fastest) version of the hierarchical genome assembly process, at 100x coverage using this length cutoff.
  7. Check for that pesky PacBio control plasmid and remove it from the assembly.
  8. Check chromosome(s) and plasmid(s) individually for circularity with a dotplot. Trim ends to circularize, reorient contigs (if desired), and import to SMRT Analysis as a new reference genome.
  9. Run base modification and motif detection, which runs a second Quiver polish by default.
  10. Iteratively run RS_Resequencing until PacBio consensus accuracy is >99.999% (QV 50). In reality, you may have quite a few indels remaining that Quiver left behind. If you’re OK with that, the PacBio consensus is already at very high quality you’re good to go. If every base matters and you need something more polished, read on.
  11. Map short reads to the PacBio consensus (I use bowtie2) and generate a BAM file for Pilon.
  12. Run Pilon to polish the remaining indels.
  13. Submit your shiny new genome to NCBI for annotation via the Prokaryotic Genome Annotation Pipeline.
  14. Upload raw PacBio data (one .bas.h5 file, three .bax.h5 files, and one .metadata.xml file), as well as base modification data, to the Sequence Read Archive. NCBI does not currently support linking metadata.xml files, but they assure me they are working on it.
  15. for resequencing only  Compare your polished assembly to a published reference with nucmer or a similar tool to identify SNPs and structural variants.
  16. for resequencing only  Use RATT to transfer an existing annotation onto a resequenced strain with new genomic coordinates.
  17. Write up a Genome Announcement and flood the literature with more finished microbial genomes!

We followed this approach with Geoalkalibacter subterraneus – read about it here.