Human Genome
(M. Caputi)
Funded by the NSF



In this project, we plan to generate a novel program, running on the proposed supercluster, to locate short consensus sequences in the human genome. The sequencing of the human genome has opened the door to a new era in biomedical research. The challenge we now face is the proper processing and interpretation of the enormous amount of data that has been collected. The human genome is divided into basic units called genes; genes contain the information needed by cells to synthesize proteins. The human genome is composed of 3 × 109 nucleotides that code for roughly 30,000 genes. Protein synthesis is the work of the translation machinery, which translates a messenger ribonucleic acid (mRNA) and assembles a functional protein. The messenger RNA is obtained by another cellular machinery, the transcription machinery, which transcribes a gene contained in the genome into RNA.

Although the human genome contains less than 30,000 genes, there are roughly 70,000 different messenger RNAs and thus proteins in a cell. Another interesting fact is that the average length of a gene is 27,000 nucleotides while the average length of a messenger RNA, which will be translated into a protein, is 3,000 nucleotides. Genes are composed of sequences, named exons, which are coding for proteins, and intervening sequences, named introns, that do not code for proteins and are removed by a gene splicing mechanism. The splicing mechanism removes all the non-coding introns from a primary messenger RNA to yield a final messenger RNA where only coding exons are present. Intron removal and exon retention is a highly regulated process. This process can generate a number of different messenger RNAs from a single precursor RNA, through alternative inclusion of different exons, in a process named alternative splicing. Alternative splicing is regulated in response to different physiological and developmental stimuli. The precise mechanism underlying this regulation still remains largely unknown. In mammals, intronic sequences are extremely large when compared to exonic coding sequences. The average exon length is 300 nucleotides, whereas the average intron length is 2,700 nucleotides. In several genes, introns can exceed 100,000 nucleotides. The mechanism that allows the splicing machinery to precisely excise large introns (up to 5 × 105 nucleotides) is unknown. The proper understanding of such a mechanism is of great importance since aberrant splicing of these long introns is often the cause of genetic diseases and can be linked to carcinogenesis. Cellular proteins often recognize short RNA sequences to properly direct and position the splicing machinery. These proteins are the key factors in the regulation of the splicing process. Several of the proteins recognizing short exonic sequences have been characterized in past years, nevertheless almost nothing is known on proteins binding large introns.

We plan to develop a novel program able to scan long intronic sequences (30,000 nucleotides plus), looking for short conserved motifs (6-12 nucleotides) that could be involved in the excision of these long introns from the primary transcript RNA. This will involve a two- fold process, the first of which is generating a data structure using Depth-First Iterative-Deepening method (DFID). In this method, first perform a depth-first search to depth one. Then, discarding the nodes generated in the first search, start over and do a depth-first search to level two, three ... until the goal state is reached. This is a computationally intensive problem since we have to generate all possible patterns in the sequence before we can run a search of common motifs. These patterns are generated and stored in a data structure, namely a tree structure. The second part is to traverse the tree to locate matches using the same DFID methods.The number of potential sequences that could be generated is phenomenal in the magnitude of n(n − 1)/r. This is based on the assumption that we consider motifs of all lengths. However, due to the overwhelming storage and processing required for such a task, we limit ourselves to motifs of lengths 6-12 nucleotides. This is still very computationally intensive for the processing power of a stand alone desktop, therefore a cluster-based high-performance supercomputer will be well suited for the task.

Our goal is to have this program available with a web based interface. Therefore, any user with web access will be able to use it to scan a putative sequence for consensus motifs. Although the focus of our research is on long intronic sequences, this kind of program will allow the scanning of any part of the genome for short consensus sequences. We can foresee the use of such a program for a variety of applications and research projects, thus becoming a useful tool for the analysis of the human genome.

As a case study we will utilize the bcl-2 and bcl-x genes. These are members of the bcl family of apoptotic genes. These genes are involved in several types of cancer, and they are both characterized by an extremely large intron (50,000 nucleotides in bcl-x, 200,000 nucleotides in bcl-2). We will then extend our search to all the large introns (30,000 nucleotides plus) present in the human, mouse and rat databases. The final goal of our research will be testing the short consensus sequences we will characterize in experimental models, utilizing short minigenes expressed in human cells cultured in-vitro.

In summary, the proposed project involves identifying one or more frequently appeared substrings from a long string – finding out how long introns are spliced out from the messenger RNA can probably help us understanding the causes of some genetic disease and some types of cancer. It is not only computationally intensive, but also communication intensive if the long string is not partitioned and replicated in a careful way. The shared-memory approach used in an SGI Altix certainly helps in reducing the complexity of the latter. This will be confirmed through extensive simulation once we acquire the equipment.