Human Genome
(M. Caputi)
Funded by the NSF
In this project, we plan to generate a novel program, running on the proposed supercluster, to
locate short consensus sequences in the human genome. The sequencing of the human genome has opened the door
to a new era in biomedical research. The challenge we now face is the proper processing and interpretation
of the enormous amount of data that has been collected. The human genome is divided into basic units
called genes; genes contain the information needed by cells to synthesize proteins. The human genome is
composed of 3 × 109 nucleotides that code for roughly 30,000 genes. Protein synthesis is the work of the
translation machinery, which translates a messenger ribonucleic acid (mRNA) and assembles a functional
protein. The messenger RNA is obtained by another cellular machinery, the transcription machinery, which
transcribes a gene contained in the genome into RNA.
Although the human genome contains less than 30,000 genes, there are roughly 70,000 different
messenger RNAs and thus proteins in a cell. Another interesting fact is that the average length of a gene is
27,000 nucleotides while the average length of a messenger RNA, which will be translated into a protein,
is 3,000 nucleotides. Genes are composed of sequences, named exons, which are coding for proteins, and
intervening sequences, named introns, that do not code for proteins and are removed by a gene splicing
mechanism. The splicing mechanism removes all the non-coding introns from a primary messenger RNA
to yield a final messenger RNA where only coding exons are present. Intron removal and exon retention is
a highly regulated process. This process can generate a number of different messenger RNAs from a
single precursor RNA, through alternative inclusion of different exons, in a process named alternative splicing. Alternative splicing is
regulated in response to different physiological and developmental stimuli.
The precise mechanism underlying this regulation still remains largely unknown. In mammals, intronic sequences
are extremely large when compared to exonic coding sequences. The average exon length is 300
nucleotides, whereas the average intron length is 2,700 nucleotides. In several genes, introns can exceed
100,000 nucleotides. The mechanism that allows the splicing machinery to precisely excise large introns
(up to 5 × 105 nucleotides) is unknown. The proper understanding of such a mechanism is of great importance
since aberrant splicing of these long introns is often the cause of genetic diseases and can be
linked to carcinogenesis. Cellular proteins often recognize short RNA sequences to properly direct and
position the splicing machinery. These proteins are the key factors in the regulation of the splicing process.
Several of the proteins recognizing short exonic sequences have been characterized in past years,
nevertheless almost nothing is known on proteins binding large introns.
We plan to develop a novel program able to scan long intronic sequences (30,000 nucleotides plus),
looking for short conserved motifs (6-12 nucleotides) that could be involved in the excision of these long introns
from the primary transcript RNA. This will involve a two- fold process, the first of which is generating a data
structure using Depth-First Iterative-Deepening method (DFID). In this method, first perform a depth-first
search to depth one. Then, discarding the nodes generated in the first search, start over and do a depth-first
search to level two, three ... until the goal state is reached. This is a computationally intensive problem since
we have to generate all possible patterns in the sequence before we can run a search of common motifs.
These patterns are generated and stored in a data structure, namely a tree structure. The second part is to
traverse the tree to locate matches using the same DFID methods.The number of potential sequences that
could be generated is phenomenal in the magnitude of n(n − 1)/r. This is based on the assumption that we
consider motifs of all lengths. However, due to the overwhelming storage and processing required for such a
task, we limit ourselves to motifs of lengths 6-12 nucleotides. This is still very computationally intensive for
the processing power of a stand alone desktop, therefore a cluster-based high-performance supercomputer
will be well suited for the task.
Our goal is to have this program available with a web based interface. Therefore, any user with
web access will be able to use it to scan a putative sequence for consensus motifs. Although the focus of our
research is on long intronic sequences, this kind of program will allow the scanning of any part of the genome
for short consensus sequences. We can foresee the use of such a program for a variety of applications and
research projects, thus becoming a useful tool for the analysis of the human genome.
As a case study we will utilize the bcl-2 and bcl-x genes. These are members of the bcl family of
apoptotic genes. These genes are involved in several types of cancer, and they are both characterized by an
extremely large intron (50,000 nucleotides in bcl-x, 200,000 nucleotides in bcl-2). We will then extend our
search to all the large introns (30,000 nucleotides plus) present in the human, mouse and rat databases. The
final goal of our research will be testing the short consensus sequences we will characterize in experimental
models, utilizing short minigenes expressed in human cells cultured in-vitro.
In summary, the proposed project involves identifying one or more frequently appeared substrings
from a long string – finding out how long introns are spliced out from the messenger RNA can probably help us
understanding the causes of some genetic disease and some types of cancer. It is not only computationally
intensive, but also communication intensive if the long string is not partitioned and replicated in a careful
way. The shared-memory approach used in an SGI Altix certainly helps in reducing the complexity of the
latter. This will be confirmed through extensive simulation once we acquire the equipment.
|