Genomes are like the biological owner’s manual for all living things. Cells read DNA instantaneously, getting instructions necessary for an organism to grow, function and reproduce. But for humans, deciphering this “book of life” is significantly more difficult.
Nowadays, researchers typically rely on next-generation sequencers to translate the unique sequences of DNA bases (there are only four) into letters: A, G, C and T. While DNA strands can be billions of bases long, these machines produce very short reads, about 50 to 300 characters at a time. To extract meaning from these letters, scientists need to reconstruct portions of the genome — a process akin to rebuilding the sentences and paragraphs of a book from snippets of text.
But this process can quickly become complicated and time-consuming, especially because some genomes are enormous. For example, while the human genome contains about 3 billion bases, the wheat genome contains nearly 17 billion bases and the pine genome contains about 23 billion bases. Sometimes the sequencers will also introduce errors into the dataset, which need to be filtered out. And most of the time, the genomes need to be assembled de novo, or from scratch. Think of it like putting together a ten billion-piece jigsaw puzzle without a complete picture to reference.
By applying some novel algorithms, computational techniques and the innovative programming language Unified Parallel C (UPC) to the cutting-edge de novo genome assembly tool Meraculous, a team of scientists from the Lawrence Berkeley National Laboratory (Berkeley Lab)’s Computational Research Division (CRD), Joint Genome Institute (JGI) and UC Berkeley, simplified and sped up genome assembly, reducing a months-long process to mere minutes. This was primarily achieved by “parallelizing” the code to harness the processing power of supercomputers, such as the National Energy Research Scientific Computing Center’s (NERSC’s) Edison system. Put simply, parallelizing code means splitting up tasks once executed one-by-one and modifying or rewriting the code to run on the many nodes (processor clusters) of a supercomputer all at once.
“Using the parallelized version of Meraculous, we can now assemble the entire human genome in about eight minutes using 15,360 computer processor cores. With this tool, we estimate that the output from the world’s biomedical sequencing capacity could be assembled using just a portion of NERSC’s Edison supercomputer,” says Evangelos Georganas, a UC Berkeley graduate student who led the effort to parallelize Meraculous. He is also the lead author of a paper published and presented at the SC Conference in November 2014.
“This work has dramatically improved the speed of genome assembly,” says Leonid Oliker computer scientist in CRD. “The new parallel algorithms enable assembly calculations to be performed rapidly, with near linear scaling over thousands of cores. Now genomics researchers can assemble large genomes like wheat and pine in minutes instead of months using several hundred nodes on NERSC’s Edison.”
Supercomputers: A Game Changer for Assembly
High throughput and relatively low cost next-generation DNA sequencers have allowed researchers to look for biological solutions to everything from generating clean energy and environmental cleanup to identifying connections between genetic mutations and cancer. For the most part, these machines are very accurate at recording the sequence of DNA bases. But sometimes errors such as substitutions, repetitions, transpositions and omissions do occur — akin to “typos” in a book. These errors complicate analysis by making it harder to assemble genomes and identify genetic mutations. They can also lead researchers to misinterpret the function of a gene.
One technique that researchers often use to identify errors is called shotgun sequencing. This involves taking numerous copies of a DNA strand, breaking it up randomly into numerous smaller pieces and then sequencing each piece separately. This produces a number of overlapping short reads that allow scientists to eventually reassemble the whole DNA strand. Sequencing numerous copies of the same DNA strand also helps identify errors. But for a particularly complex genome, this process also generates a tremendous amount of data, sometimes several terabytes.
To identify errors in this data quickly and effectively, the Berkeley Lab and UC Berkeley team relied on “Bloom filters” and massively parallel supercomputers. Conceived by Burton H. Bloom in 1970, Bloom filters are very efficient at recognizing whether or not an element is a member of the set. Thus, researchers can rely on this tool to tell them if a base is out of place and is likely a mistake. Because bit arrays comprise a Bloom filter’s underlying structure, they also require relatively little memory, making them ideal for querying massive datasets.
“Applying Bloom filters to this part of the genome assembly problem is not new, it has been done before. What we have done differently is to get Bloom filters to work with distributed memory systems,” says Aydin Buluç, a research scientist in CRD. “This task was not trivial, it required some computing expertise to accomplish.”
The team also developed solutions for parallelizing data input and output (I/O). “When you have several terabytes of data, just getting the computer to read your data and output results can be a huge bottleneck,” says Steven Hofmeyr, a research scientist in CRD who developed these solutions. “By allowing the computer to download the data in multiple threads, we were able to speed up the I/O process from hours to minutes.”
The Latest on: Genome Assembly
via Google News
The Latest on: Genome Assembly
- Fly researchers find another layer hiding in the code of lifeon May 19, 2022 at 9:53 am
A new examination of the way different tissues read information from genes has discovered that the brain and testes appear to be extraordinarily open to the use of many different kinds of code to ...
- Fly researchers find another layer to the code of lifeon May 19, 2022 at 9:20 am
The researchers say the use of rare pieces of code may be another layer of control in the genome that could be essential to fertility ... is evolutionarily newer and helps to build the ribosome, the ...
- Ginkgo Bioworks: Leading Platform But Questionable Business Modelon May 19, 2022 at 7:33 am
Extremely ambitious objectives have fueled a relatively large market capitalization before Ginkgo Bioworks has achieved anything. See more on DNA stock here.
- Technological Innovations in NGS Platforms - A Key Factor Driving Growth in the Metagenomic Sequencing Market - ResearchAndMarkets.comon May 19, 2022 at 5:13 am
Impact of COVID-19 on the Metagenomic Sequencing Market Based on technology, the metagenomic sequencing market is segmented into 16S rRNA sequencing, shotgun metagenomic sequencing, whole-genome ...
- Potato genome decodedon May 14, 2022 at 4:19 am
More than 20 years after the first release of the human genome, scientists at the Ludwig-Maximilians-Universität München and the Max Planck Institute for ...
- Mechanisms of HCV Survival in the Hoston May 12, 2022 at 5:00 pm
viral genome replication, and the assembly and release of virions. All these events occur outside the nucleus of the host cell. After translation of the viral proteins that are necessary to ...
- Understanding Single Cell Sequencing, How It Works and Its Applicationson May 8, 2022 at 5:00 pm
Commercial DNA sequencers were first produced in the late 1990s. They have been named second-generation or next-generation sequencers to distinguish them from the first experimental ones, and allowed ...
- How genome organization influences cell fateon May 1, 2022 at 6:32 pm
Research shows how a protein complex, called chromatin assembly factor-1, controls genome organization to maintain lineage fidelity. Understanding the molecular mechanisms that specify and ...
- MicrobioSeq Introduces Fungal Whole-Genome De Novo Sequencing Platform to Facilitate Microbial Researchon April 25, 2022 at 5:00 pm
Fungal genome research is a method to obtain the whole fungal genome sequence through genome sequencing and assembly, and to study its structure and function. In-depth research on these strains also ...
via Bing News