Speeding Up Genome Assembly, from Months to Minutes

Human chromosomes. Courtesy of Jane Ades, NHGRI

Genomes are like the biological owner’s manual for all living things. Cells read DNA instantaneously, getting instructions necessary for an organism to grow, function and reproduce. But for humans, deciphering this “book of life” is significantly more difficult.

Nowadays, researchers typically rely on next-generation sequencers to translate the unique sequences of DNA bases (there are only four) into letters: A, G, C and T. While DNA strands can be billions of bases long, these machines produce very short reads, about 50 to 300 characters at a time. To extract meaning from these letters, scientists need to reconstruct portions of the genome — a process akin to rebuilding the sentences and paragraphs of a book from snippets of text.

But this process can quickly become complicated and time-consuming, especially because some genomes are enormous. For example, while the human genome contains about 3 billion bases, the wheat genome contains nearly 17 billion bases and the pine genome contains about 23 billion bases. Sometimes the sequencers will also introduce errors into the dataset, which need to be filtered out. And most of the time, the genomes need to be assembled de novo, or from scratch. Think of it like putting together a ten billion-piece jigsaw puzzle without a complete picture to reference.

By applying some novel algorithms, computational techniques and the innovative programming language Unified Parallel C (UPC) to the cutting-edge de novo genome assembly tool Meraculous, a team of scientists from the Lawrence Berkeley National Laboratory (Berkeley Lab)’s Computational Research Division (CRD), Joint Genome Institute (JGI) and UC Berkeley, simplified and sped up genome assembly, reducing a months-long process to mere minutes. This was primarily achieved by “parallelizing” the code to harness the processing power of supercomputers, such as the National Energy Research Scientific Computing Center’s (NERSC’s) Edison system. Put simply, parallelizing code means splitting up tasks once executed one-by-one and modifying or rewriting the code to run on the many nodes (processor clusters) of a supercomputer all at once.

“Using the parallelized version of Meraculous, we can now assemble the entire human genome in about eight minutes using 15,360 computer processor cores. With this tool, we estimate that the output from the world’s biomedical sequencing capacity could be assembled using just a portion of NERSC’s Edison supercomputer,” says Evangelos Georganas, a UC Berkeley graduate student who led the effort to parallelize Meraculous. He is also the lead author of a paper published and presented at the SC Conference in November 2014.

“This work has dramatically improved the speed of genome assembly,” says Leonid Oliker computer scientist in CRD. “The new parallel algorithms enable assembly calculations to be performed rapidly, with near linear scaling over thousands of cores. Now genomics researchers can assemble large genomes like wheat and pine in minutes instead of months using several hundred nodes on NERSC’s Edison.”

Supercomputers: A Game Changer for Assembly

High throughput and relatively low cost next-generation DNA sequencers have allowed researchers to look for biological solutions to everything from generating clean energy and environmental cleanup to identifying connections between genetic mutations and cancer. For the most part, these machines are very accurate at recording the sequence of DNA bases. But sometimes errors such as substitutions, repetitions, transpositions and omissions do occur — akin to “typos” in a book. These errors complicate analysis by making it harder to assemble genomes and identify genetic mutations. They can also lead researchers to misinterpret the function of a gene.

One technique that researchers often use to identify errors is called shotgun sequencing. This involves taking numerous copies of a DNA strand, breaking it up randomly into numerous smaller pieces and then sequencing each piece separately. This produces a number of overlapping short reads that allow scientists to eventually reassemble the whole DNA strand. Sequencing numerous copies of the same DNA strand also helps identify errors. But for a particularly complex genome, this process also generates a tremendous amount of data, sometimes several terabytes.

To identify errors in this data quickly and effectively, the Berkeley Lab and UC Berkeley team relied on “Bloom filters” and massively parallel supercomputers. Conceived by Burton H. Bloom in 1970, Bloom filters are very efficient at recognizing whether or not an element is a member of the set. Thus, researchers can rely on this tool to tell them if a base is out of place and is likely a mistake. Because bit arrays comprise a Bloom filter’s underlying structure, they also require relatively little memory, making them ideal for querying massive datasets.

“Applying Bloom filters to this part of the genome assembly problem is not new, it has been done before. What we have done differently is to get Bloom filters to work with distributed memory systems,” says Aydin Buluç, a research scientist in CRD. “This task was not trivial, it required some computing expertise to accomplish.”

The team also developed solutions for parallelizing data input and output (I/O). “When you have several terabytes of data, just getting the computer to read your data and output results can be a huge bottleneck,” says Steven Hofmeyr, a research scientist in CRD who developed these solutions. “By allowing the computer to download the data in multiple threads, we were able to speed up the I/O process from hours to minutes.”

Enabling personalised cancer treatment with whole genome sequencing

The Latest on: Genome Assembly

[google_news title=”” keyword=”Genome Assembly” num_posts=”10″ blurb_length=”0″ show_thumb=”left”]

via Google News

The Latest on: Genome Assembly

NewBiologix Launches Next-Generation Sequencing and Optical Mapping Platform to Drive the Next Wave of Biopharmaceutical Discoveries
on April 30, 2024 at 5:00 am
NewBiologix SA, a technology innovation company focused on addressing gene therapy manufacturing gaps, announced the launch of its next-generation sequencing (NGS) and optical mapping platform, a ...
Vaccinia virus: New insights into the structure and function of the poxvirus prototype
on April 29, 2024 at 8:31 am
An outbreak of infections with the mpox virus—formerly known as monkeypox—in Europe in 2022 led to a rise in interest in poxviruses. An international research team has now investigated the structure ...
Bacteria Use a Protein Complex To Enhance Blockage of Phage Replication
on April 28, 2024 at 5:00 pm
In a new study, a team from The Ohio State University has reported on the molecular assembly of one of the most common anti ... the resulting complex is highly adept at snipping the genome of an ...
Revolutionizing Sweetpotato Genetics: A Comprehensive Update to the 'Taizhong 6' Genome Annotation
on April 26, 2024 at 4:41 am
A research team has substantially enhanced the annotation of the sweetpotato genome 'Taizhong 6', introducing a more comprehensive and detailed ...
Toyota will spend $1.4 billion to build electric 3-row SUV in Indiana
on April 25, 2024 at 10:48 am
Curiously, Toyota says this will be an entirely different all-new three-row electric SUV to the all-new three-row electric SUV that it will build at its factory in Georgetown, Kentucky. That plant ...
Sweetpotato genetics: A comprehensive update to the 'Taizhong 6' genome annotation
on April 25, 2024 at 10:11 am
A research team has substantially enhanced the annotation of the sweetpotato genome "Taizhong 6," introducing a more comprehensive and detailed version, v1.0.a2. This update utilizes 12 Nanopore ...
Unveiling the genetic blueprint of safflower
on April 24, 2024 at 2:15 pm
A research team has completed a high-quality chromosome-scale assembly of the Chuanhonghua 1 safflower genome. This work sheds light on the genetic underpinnings of crucial traits like linoleic acid ...
Virtual Viruses Reveal Complex Genomic Dynamics
on April 24, 2024 at 6:51 am
A virus may be microscopic, but it contains thousands of nucleic acid bases strategically packaged into a protein shell. Knowing how the virus organizes these vast information stores in a compact ...
Cells may possess hidden communication system
on April 23, 2024 at 5:00 pm
Cells constantly navigate a dynamic environment, facing ever-changing conditions and challenges. But how do cells swiftly adapt to these environmental fluctuations? A new study is answering that ...
International DNA Day launch for Hong Kong’s Moonshot for Biology
on April 23, 2024 at 5:00 pm
A significant portion of modern knowledge in biology has emerged through sequencing the genetic code of the world’s biodiversity, which to date has been largely uncharacterized and increasingly ...

via Bing News

What's Your Reaction?

Don't Like it!

I Like it!

Genomes are like the biological owner’s manual for all living things. Cells read DNA instantaneously, getting instructions necessary for an organism to grow, function and reproduce. But for humans, deciphering this “book of life” is significantly more difficult.

Read more: Speeding Up Genome Assembly, from Months to Minutes

The Latest on: Genome Assembly

The Latest on: Genome Assembly

What's Your Reaction?

Leave a Reply