Troubleshooting common errors in assemblies of long-read metagenomes
ORCID: https://orcid.org/0000-0002-4933-2896, Sachdeva, Rohan, Banfield, Jillian F and Eren, A Murat
ORCID: https://orcid.org/0000-0001-9013-4827
;
Assessing the accuracy of long-read assemblies, especially from complex environmental metagenomes that include underrepresented organisms, is challenging. Here we benchmark four state-of-the-art long-read assembly software programs, HiCanu, hifiasm-meta, metaFlye and metaMDBG, on 21 PacBio HiFi metagenomes spanning mock communities, gut microbiomes and ocean samples. By quantifying read clipping events, in which long reads are systematically split during mapping to maximize the agreement with assembled contigs, we identify where assemblies diverge from their source reads. Our analyses reveal that long-read metagenome assemblies can include >40 errors per 100 million base pairs of assembled contigs, including multi-domain chimeras, prematurely circularized sequences, haplotyping errors, excessive repeats and phantom sequences. We provide an open-source tool and a reproducible workflow for rigorous evaluation of assembly errors, charting a path toward more reliable genome recovery from long-read metagenomes.
ORCID: https://orcid.org/0000-0002-4933-2896, Sachdeva, Rohan, Banfield, Jillian F and Eren, A Murat
ORCID: https://orcid.org/0000-0001-9013-4827
;
