Now Reading
Speeding Up Genome Assembly, from Months to Minutes

Speeding Up Genome Assembly, from Months to Minutes

Human chromosomes. Courtesy of Jane Ades, NHGRI
Human chromosomes. Courtesy of Jane Ades, NHGRI

Genomes are like the biological owner’s manual for all living things. Cells read DNA instantaneously, getting instructions necessary for an organism to grow, function and reproduce. But for humans, deciphering this “book of life” is significantly more difficult.

Nowadays, researchers typically rely on next-generation sequencers to translate the unique sequences of DNA bases (there are only four) into letters: A, G, C and T. While DNA strands can be billions of bases long, these machines produce very short reads, about 50 to 300 characters at a time. To extract meaning from these letters, scientists need to reconstruct portions of the genome — a process akin to rebuilding the sentences and paragraphs of a book from snippets of text.

But this process can quickly become complicated and time-consuming, especially because some genomes are enormous. For example, while the human genome contains about 3 billion bases, the wheat genome contains nearly 17 billion bases and the pine genome contains about 23 billion bases. Sometimes the sequencers will also introduce errors into the dataset, which need to be filtered out. And most of the time, the genomes need to be assembled de novo, or from scratch. Think of it like putting together a ten billion-piece jigsaw puzzle without a complete picture to reference.

By applying some novel algorithms, computational techniques and the innovative programming language Unified Parallel C (UPC) to the cutting-edge de novo genome assembly tool Meraculous, a team of scientists from the Lawrence Berkeley National Laboratory (Berkeley Lab)’s Computational Research Division (CRD), Joint Genome Institute (JGI) and UC Berkeley, simplified and sped up genome assembly, reducing a months-long process to mere minutes. This was primarily achieved by “parallelizing” the code to harness the processing power of supercomputers, such as the National Energy Research Scientific Computing Center’s (NERSC’s) Edison system. Put simply, parallelizing code means splitting up tasks once executed one-by-one and modifying or rewriting the code to run on the many nodes (processor clusters) of a supercomputer all at once.

“Using the parallelized version of Meraculous, we can now assemble the entire human genome in about eight minutes using 15,360 computer processor cores. With this tool, we estimate that the output from the world’s biomedical sequencing capacity could be assembled using just a portion of NERSC’s Edison supercomputer,” says Evangelos Georganas, a UC Berkeley graduate student who led the effort to parallelize Meraculous. He is also the lead author of a paper published and presented at the SC Conference in November 2014.

“This work has dramatically improved the speed of genome assembly,” says Leonid Oliker computer scientist in CRD. “The new parallel algorithms enable assembly calculations to be performed rapidly, with near linear scaling over thousands of cores. Now genomics researchers can assemble large genomes like wheat and pine in minutes instead of months using several hundred nodes on NERSC’s Edison.”

Supercomputers: A Game Changer for Assembly

High throughput and relatively low cost next-generation DNA sequencers have allowed researchers to look for biological solutions to everything from generating clean energy and environmental cleanup to identifying connections between genetic mutations and cancer. For the most part, these machines are very accurate at recording the sequence of DNA bases. But sometimes errors such as substitutions, repetitions, transpositions and omissions do occur — akin to “typos” in a book. These errors complicate analysis by making it harder to assemble genomes and identify genetic mutations. They can also lead researchers to misinterpret the function of a gene.

One technique that researchers often use to identify errors is called shotgun sequencing. This involves taking numerous copies of a DNA strand, breaking it up randomly into numerous smaller pieces and then sequencing each piece separately. This produces a number of overlapping short reads that allow scientists to eventually reassemble the whole DNA strand. Sequencing numerous copies of the same DNA strand also helps identify errors. But for a particularly complex genome, this process also generates a tremendous amount of data, sometimes several terabytes.

To identify errors in this data quickly and effectively, the Berkeley Lab and UC Berkeley team relied onBloom filters” and massively parallel supercomputers. Conceived by Burton H. Bloom in 1970, Bloom filters are very efficient at recognizing whether or not an element is a member of the set. Thus, researchers can rely on this tool to tell them if a base is out of place and is likely a mistake.  Because bit arrays comprise a Bloom filter’s underlying structure, they also require relatively little memory, making them ideal for querying massive datasets.

“Applying Bloom filters to this part of the genome assembly problem is not new, it has been done before. What we have done differently is to get Bloom filters to work with distributed memory systems,” says Aydin Buluç, a research scientist in CRD.  “This task was not trivial, it required some computing expertise to accomplish.”

The team also developed solutions for parallelizing data input and output (I/O). “When you have several terabytes of data, just getting the computer to read your data and output results can be a huge bottleneck,” says Steven Hofmeyr, a research scientist in CRD who developed these solutions. “By allowing the computer to download the data in multiple threads, we were able to speed up the I/O process from hours to minutes.”

See Also

Read more: Speeding Up Genome Assembly, from Months to Minutes

 

The Latest on: Genome Assembly

[google_news title=”” keyword=”Genome Assembly” num_posts=”10″ blurb_length=”0″ show_thumb=”left”]

via Google News

 

The Latest on: Genome Assembly

via  Bing News

 

What's Your Reaction?
Don't Like it!
0
I Like it!
0
Scroll To Top