Next Generation Tools Aid Interdisciplinary Genome Research

Article Author: 
By Amelia Jaycen, Publications Assistant, Office of Research and Economic Development

In 1953, James D. Watson and Francis Crick discovered the double-helix structure of the DNA strand –a ribbon of genetic information that lives in each cell of a living organism.   Later, in 1990, a group of organizations including the National Institutes of Health launched  the Human Genome Project, a global collaborative effort to identify all the genes in the human DNA strand.  At that time, the event was heralded as the largest investigative project in modern science, and it took 13 years and nearly $3 billion to yield a complete human genome. 

The Human Genome Project completed in 2003 was followed by a variety of other DNA research projects conducted by various organizations.  The widespread study of DNA ushered in a “genomic revolution” characterized by constant technological advances in the fields of genetics and molecular biology.  Nearly a decade later, its momentum is still steady as hundreds of new biological tools amass stores of genomic data. 

DNA sequencing technology has enabled a rate of data growth that outpaces Moore’s Law by more than doubling each year.  Now, the next generation of biological researchers must find new ways to process the extraordinary amount of data available.

“Bioinformatics and computational biology is a field which has now come of age,” UNT bioinformatics professor Dr. Rajeev Azad said.  “When you have this large leap of data, you can dig deeper into many biological questions, and investigative computational methods come into play.” 

After obtaining a DNA sequence, a necessary task is to identify the regions within the sequence that code for proteins, and those regions are called genes.  Identification of all the genes in a genome is nearly impossible to do with wet lab experiments, Azad said, thus signaling a need for the power and usefulness of statistical and mathematical models to predict those features within genomes.

It is especially complex to identify genes in a human genome, but it is necessary because coding genes are the genes that may have some variations or become mutated and cause disease.  So it is important to identify all of them – to completely annotate the human genome. 

This task is not yet completed, but as new technologies become available it is a goal within reach.

Bioinformatics at UNT

Bioinformatics is a growing field of research at universities around the globe with applications throughout the sub-disciplines of biological science.  UNT established bioinformatics offerings in 2010 with the addition of Dr. Qunfeng Dong as assistant professor in the Departments of Biological Sciences and Computer Science and Engineering.  He established the Dong Lab, a research group involved in collaborative bioinformatics research projects.

The Dong Lab’s current microbiota research is part of the National Institutes of Health (NIH) Human MicroBiome Project, which has the goal of characterizing microbial communities found at multiple sites in the human body, looking for changes in that microbiome, and attempting to correlate those changes to human health issues.  The NIH HMB Dong Lab projects include characterizing lung microbiomes and male urethral microbiomes – microscopic communities of bacteria that can be identified using the well-known identifier gene, 16S rRNA. 

Daniel Munro, National Merit Scholar, team member“A lot of bioinformatics is about making computer tools for biologists to use,” Daniel Munro, senior bioinformatics student and Microbiota team member said.  “We collaborate with biologists who sequence data, so we receive raw sequences and perform statistical analysis of their data.” 

The Dong Lab hosts the website, which offers bioinformatics data analysis services, including project-specific websites where clients and involved researchers can track progress and retrieve results generated by Dong’s team.   Bioinformatics coursework for students like Munro, who is a National Merit Scholar at UNT, includes biology, computer science, and statistical mathematics courses, and students receive degrees in biology with an emphasis on bioinformatics. 

“We learn all the skills,” Munro said.  “We’re learning programming, and the computer science members are learning the biology.  And there are others, too.  We also deal with graphic representations of the data because results can be so complex they are hard to read without a visual representation.” 

Data analysis performed by the Dong Lab team involves the use of many different programs, several programming languages, and data obtained from various gene databases.  In the UNT program, Munro has learned the statistical programming language “R,” uses the “Mothur” program for sequence processing, and he worked on development of the Multi Genome Synteny Viewer (mGSV) — a tool for comparing multiple genomes at once. 

Thanks to a global community of researchers who are like-minded about open-access, bioinformatics analysts are also able to use online databases like the NCBI databank hosted by the National Center for Biotechnology Information.  From the site, they can download whole or partial genomes.  NCBI and other online sites not only offer access to genomes, but opportunities to publish papers on annotations completed, software developed, and reviews of existing software.  Scientists and collaborators can upload datasets to built-in programs or download material to run through their own computational models, and the information can be shared with researchers across the globe and their research downloaded to UNT. 

However, computer files containing whole genomes are incredibly large.  Fortunately, UNT researchers have access to the Talon supercomputer — a processing machine with over 2200 processing cores and 200 terabytes of high-throughput storage space.  Talon is managed by the High Performance Computing Initiative (HPC) Services Team.

“Bioinformatics is a rapidly growing field that is using the power of ‘Big Data’ to investigate DNA base pairs and make correlational relationships,” HPC Services Manager, Scott Yokel said. “The nature of Professor Dong's research involves thousands upon thousands of mostly really small computations with a few really large memory computations that utilize massive amounts of earlier generated data.”

Talon is used by researchers across campus, but the Dong Lab consistently tops the charts for storage use.  Intensive computations are staple to bioinformatic research, and this type of data processing is becoming essential to biological sciences of all types. 

From the petri dish to the computer


The genomic data for a given bacteria can contain as many as five million characters— a gigantic sentence made up of combinations of four letters: A, T, C, and G, which represent the four basic building blocks of DNA. The central quest of bioinformatics is to extract meaningful information from these giant lengths of letters—a task akin to deciphering the language of DNA to pick out words, or combinations of letters, that tell the story of how organisms function at the molecular level.

Sequenced genomes received by the Dong lab team come in pieces, because there is no way to view a five million-character sequence at once.  The first step in bioinformatic data analysis is to assemble the pieces into a whole, like a giant puzzle.  After assembly, other software is used for gene prediction, which looks for patterns in the sequences that represent genes.  Once genes are located, or predicted, researchers are interested in the gene’s function – gene annotation. 

Claudia Vilo MunozDong Lab member and Fulbright Scholar from Chile

Claudia Vilo Munoz is a doctoral student on Dong’s team who performs annotation of an entire genome instead of searching for a single identifying gene among a community of microbiota like Munro.  Where Munro looks for a common thread to identify similar species in a microscopic community, Vilo focuses on a single bacteria and completes its entire genome.  The completed genome is then submitted to online databases for announcement, where others can easily access and use her results. 

“My work involves annotation of all 5000-6000 genes in a single bacterium so we can identify the differences between that organism and a similar one,” Vilo said.  “You can imagine if we did not have computational tools, we just would not know how the genome works.” 

Researchers can choose from a variety of computational tools. Vilo accesses genome databanks such as the NCBI Gene Bank, uses a programming language called PERL, a free online software called RAST for gene annotation, Soap Denovo for assembly, and Artemis to create visual representations of data. 

Vilo is a Fulbright Scholar from Chile, where she began a masters degree in biology.  Upon learning about different bioinformatics programs in the U.S., she applied for a Fulbright scholarship, and won the opportunity to study at UNT with Dong.  The UNT program was one of the few willing to teach Vilo the necessary computer science skills to proceed in bioinformatics studies.  In the last year, she has worked with Dr. Daniel Kunz to complete annotation of a bacteria strain that is able to metabolize cyanide.  Specifically, Vilo was looking for the gene that makes this strain different from the others so that the source of its extraordinary ability to use cyanide can be pinpointed, a continuation of research Kunz had begun.  This particular strain of bacteria could potentially be put to use as a biodegradation tool to help clean the environment and prevent water contamination in the proximity of goldmines, which produce cyanide waste. 

Vilo and Dong are currently seeking National Science Foundation funding for a collaboration with the Universidad de Antofagasta, where Vilo received her BSc.  Possible research projects include studying genomes of species endemic to Northern Chile. 

“The spirit of my scholarship is to share the knowledge in Chile, so we are hoping to get NSF funding to do collaborative work there, where it is harder to get funding,” Vilo said.  “I would like to go back to my university, teach bioinformatics there, and continue these investigations.” 

Mathematics and Biology: a Symbiotic Relationship

Dr. Rajeev Azad brings expertise in Markov chain mathematical modeling to UNT.  He joined faculties of the Departments of Biological Sciences and also Mathematics in January of 2011.   With a background in biophysics, biomathematics, gene prediction, and a decade of research in horizontal gene transfer, he has served to augment the growing bioinformatics program.  Azad was previously a research assistant professor at Case Western Reserve University and the University of Pittsburgh, where he studied bacterial gene transfer and structural variations in human genomes in projects funded by the National Institutes of Health.  Azad’s bioinformatics laboratory at UNT consists of nearly a dozen students from various backgrounds who apply mathematics and computer science to solve biological problems.  

Dr. Rajeev AzadProfessor of Marthematics and Biological Sciences, member of UNT bioinformatics facultyIn 2007 Azad and his collaborators introduced applications of Markov chain models to genome segmentation.  The higher-order Markov models they developed are a family of statistical computational models that  are more sensitive, and therefore yield results with more biological significance, than their predecessors. 

What do statistical models have to do with biological research?

“First you have to understand the biology of the problem,” Azad said.  “Once you understand the biology of it, then you have to think about how to develop a model to address that problem.  So then you are thinking about mathematics and statistics – how to get a model that can describe that system and help solve that problem.”

Miah Jn Charles is a mathematics graduate student currently working with Azad on cancer genomics.  They are using statistical methods to identify structural variations in cancer genomes such as duplications or deletions of genes or chromosomal regions.  Such variations have  been implicated in many diseases including cancer.

Jn Charles is studying the effectiveness of using different algorithms to locate “copy number  variations” in  chromosomal regions.  She applies information entropy based techniques, and she evaluates both the sensitivity and specificity of their ability to make predictions about the location of copy number variants within a DNA sequence.  Her work will be validated, or tested against pre-existing methods to determine its accuracy— a signature practice in bioinformatics research.

This is the first bioinformatics project Jn Charles has ever worked on, and with bachelors degrees in mathematics and computer programming, she never thought she would end up doing biological research.  But now, she is hooked. 

Miah Jn Charles, mathematics graduate student from St. Lucia, Caribbean.

“I was doing mostly pure mathematics, and I wanted to get into applied mathematics, and do something with a real-world application,” she said.  “I took an introductory bioinformatics class that really opened my eyes to current issues in biology.  That’s what got me involved.  There are so many opportunities to solve problems in this field, and I’d like to look at one, study it, and see how it can be solved.”

Mapping the History of Genomes for a Healthy Future

Dr. Azad’s most recent work is in the area of bacterial genome evolution and the phenomenon of horizontal gene transfer, in which bacteria can accept genes from other organisms laterally instead of inheriting them from a parent cell.  For example, some virus strains have been able to change their composition, or evolve, by acquiring alien genes from another species, which gives them special characteristics, such as the ability to survive in hostile environments and resist anti-biotic drugs. 

Tracking these behaviors requires advanced statistical models that can compare different bacterial genomes, look for variations, and map the extent of transfer among bacterial genes.  To perform this comparison, Azad developed probabilistic models to quantify the behaviors – here again, he said, statistics can be used to develop models for understanding genomes, bacterial genomes in this case.

Changing genetic composition over time, or evolution, is not a skill confined to bacteria or viruses.  350 million years ago, the human sex chromosome evolved, marking the split between mammals and their avian (winged) ancestors.  Before that split, there was no x-chromosome, and its addition to human lineage enabled the development of mankind as we know it.  The chromosome made several evolutionary jumps over the course of time, each instance marking the evolution, for instance, of chimpanzees to gorillas, and gorillas to humans. 

Ravi Shankur Pandey, doctoral biology student working
in bioinformatics of human sex chromosomes

Ravi Shanker Pandey, a doctoral student working in Azad’s bioinformatics lab, tracks the history of chromosomal evolution by studying changes within the space of DNA strands where chromosomes are found.  In one study he compared mammalian sex chromosomes and documented the strata, or stages, in the evolutionary spectrum. Pandey will build on this work in his next research project by studying x-inactivation, a phenomenon related to the degradation of the male y-chromosome over time.  Since fossils of older chromosomes before degradation are in short supply, Pandey will have to first develop a computational model defining the change over time, and then access the GenomeBrowser website to obtain a genomic structure, so he can study the correlation between how different species evolved.

“For me, studying this change was like reading a story that unfolded over time,” Pandey said.  “As I learned the biological history of this problem, I found it very interesting and decided I should study in this area.”  Pandey said he is in front of a computer almost 24 hours a day, but he likes not being tied down to a laboratory.  “I can do my work from anywhere, at any time.”

Additional Context:

Other groups at UNT use similar data analysis techniques.  Dr. Armin Mikler, professor of computer science and engineering, has been studying epidemiological data using computational methods and teaches biocomputing together with Dr. Dong.  Epidemiology studies the patterns, causes, and effects of disease in a population. In 2008, Mikler established CERL, the Computational Epidemiology Research Laboratory, where statistical and probabilistic models are applied to the study of how disease spreads in an epidemic outbreak in a particular region while accounting for social behavior, available health services, and emergency response plans.  Team members at CERL develop algorithms and build simulators to forecast human health issues in public emergency situations. 

“Computational life sciences means everything in biology that has to be solved through computational methods,” Dr. Azad said.  “So it includes computational epidemiology, it includes bioinformatics, and it includes every sub discipline for addressing biological problems, including systems biology, ecological modeling, spaciotemporal behavior of plant systems, etc.  If you have applied mathematical and computational models to understand a biological problem, then that comes under the umbrella of computational life sciences.”

UNT plant biologists Vladimir Shulaev and Ron Mittler, both members of the Signaling Mechanisms in Plants research cluster, study the genomes of plants.   In 2011, Shulaev and Mittler were the first to publish the genome of strawberries. In June 2012 the UNT Department of Forensic and Investigative Genetics and the Center for Human Identification of the UNT Health Science Center invested in a DNA sequencing system via collaboration with Life Technologies Corporation, allowing those departments to step even further to the forefront of the field of forensic science through DNA analysis.

“Few years from now, with the rapid pace of progress in this field, the mention of biology will mean both experimental wet-lab biology and bioinformatics,” Dr. Azad said.  “Traditional wet-lab experimental science is becoming more and more quantitative, and as a result, both wet-lab and dry-lab techniques are being integrated to conclusively address many hypotheses and questions in biology.”

Seven of Azad's bioinformatics team members.  From Left: Nasim Al Sajjad, biochemistry; Krithika Ganapathy, TAMS; Tanvi Shah, TAMS; Ravi Shankur Pandey, biology-bioinformatics; Mehul Jani, biochemistry and molecular biology; Dr. Rajeev Azad; Miah Jn Charles, mathematics; Abdullahil Baki Bhuiyan, biochemistry and moleclar biology.  

Computational Life Sciences and Complex Bio-Environmental Systems are defined as “additional strategic areas of research selected by UNT for investment and faculty hiring in order to build capacity in strategically-important research domains.”

Dr. Rajeev Azad is a member of the Signaling Mechanisms in Plants research cluster and the Center for Advanced Scientific Computing and Modeling (CASCaM).

Dr. Qunfeng Dong is a member of the Computational Chemical Biology, Developmental Integrative Biology, and Materials Modeling research clusters and a member of CASCaM

Dr. Armin Mikler is a member of the UNT Institute of Applied Science.

All photographs by Amelia Jaycen.