Gene Prediction
Gene finding typically refers to the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes, but may also include other functional elements such as RNA genes and regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
In its earliest days, "gene finding" was based on painstaking
experimentation on living cells and organisms. Statistical analysis of
the rates of homologous recombination of several different genes could determine their order on a certain chromosome, and information from many such experiments could be combined to create a genetic map
specifying the rough location of known genes relative to each other.
Today, with comprehensive genome sequence and powerful computational
resources at the disposal of the research community, gene finding has
been redefined as a largely computational problem.
Determining that a sequence is functional should be distinguished from determining the function of the gene or its product. The latter still demands in vivo experimentation through gene knockout and other assays, although frontiers of bioinformatics research are making it increasingly possible to predict the function of a gene based on its sequence alone.
Extrinsic Approaches
In extrinsic (or evidence-based) gene finding systems, the target
genome is searched for sequences that are similar to extrinsic evidence
in the form of the known sequence of a messenger RNA (mRNA) or protein product. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been transcribed. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the genetic code.
Once candidate DNA sequences have been determined, it is a relatively
straightforward algorithmic problem to efficiently search a target
genome for matches, complete or partial, and exact or inexact. BLAST is a widely used system designed for this purpose.
A high degree of similarity to a known messenger RNA or protein
product is strong evidence that a region of a target genome is a
protein-coding gene. However, to apply this approach systemically
requires extensive sequencing of mRNA and protein products. Not only is
this expensive, but in complex organisms, only a subset of all genes in
the organism's genome are expressed at any given time, meaning that
extrinsic evidence for many genes is not readily accessible in any
single cell culture. Thus, in order to collect extrinsic evidence for
most or all of the genes in a complex organism, many hundreds or
thousands of different cell types must be studied, which itself
presents further difficulties. For example, some human genes may be
expressed only during development as an embryo or fetus, which might be
difficult to study for ethical reasons.
Despite these difficulties, extensive transcript and protein
sequence databases have been generated for human as well as other
important model organisms in biology, such as mice and yeast. For
example, the RefSeq database contains transcript and protein sequence from many different species, and the Ensembl
system comprehensively maps this evidence to human and several other
genomes. It is, however, likely that these databases are both
incomplete and contain small but significant amounts of erroneous data..
Ab Initio Approaches
Because of the inherent expense and difficulty in obtaining
extrinsic evidence for many genes, it is also necessary to resort to ab initio gene finding, in which genomic DNA sequence
alone is systematically searched for certain tell-tale signs of
protein-coding genes. These signs can be broadly categorized as either signals, specific sequences that indicate the presence of a gene nearby, or content, statistical properties of protein-coding sequence itself. Ab initio gene finding might be more accurately characterized as gene prediction, since extrinsic evidence is generally required to conclusively establish that a putative gene is functional.
In the genomes of prokaryotes, genes have specific and relatively well-understood promoter sequences (signals), such as the Pribnow box and transcription factor binding sites, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous open reading frame (ORF), which is typically many hundred or thousands of base pairs long. The statistics of stop codons
are such that even finding an open reading frame of this length is a
fairly informative sign. (Since 3 of the 64 possible codons in the
genetic code are stop codons, one would expect a stop codon
approximately every 20-25 codons, or 60-75 base pairs, in a random sequence.) Furthermore, protein-coding DNA has certain periodicities
and other statistical properties that are easy to detect in sequence of
this length. These characteristics make prokaryotic gene finding
relatively straightforward, and well-designed systems are able to
achieve high levels of accuracy.
Ab initio gene finding in eukaryotes,
especially complex organisms like humans, is considerably more
challenging for several reasons. First, the promoter and other
regulatory signals in these genomes are more complex and less
well-understood than in prokaryotes, making them more difficult to
reliably recognize. Two classic examples of signals identified by
eukaryotic gene finders are CpG islands and binding sites for a poly(A) tail.
Second, splicing
mechanisms employed by eukaryotic cells mean that a particular
protein-coding sequence in the genome is divided into several parts (exons), separated by non-coding sequences (introns).
(Splice sites are themselves another signal that eukaryotic gene
finders are often designed to identify.) A typical protein-coding gene
in humans might be divided into a dozen exons, each less than two
hundred base pairs in length, and some as short as twenty to thirty. It
is therefore much more difficult to detect periodicities and other
known content properties of protein-coding DNA in eukaryotes.
Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex probabilistic models, such as Hidden Markov Models, in order to combine information from a variety of different signal and content measurements. The GLIMMER system is a widely used and highly accurate gene finder for prokaryotes. GeneMark is another popular approach. Eukaryotic ab initio gene finders, by comparison, have achieved only limited success; notable examples are the GENSCAN and geneid programs. A few programs like CONTRAST also use machine learning approaches like support vector machines for successful gene prediction.
Other Signals
Among the derived signals used for prediction are statistics resulting from the sub-sequence statistics like k-mer statistics, Fourier transform of a pseudo-number-coded DNA, Z-curve parameters and certain run features.[1]
It has been suggested that signals other than those directly
detectable in sequences may improve gene prediction. For example, the
role of secondary structure in the identification of regulatory motifs has been reported.[2] In addition, it has been suggested that RNA secondary structure prediction helps splice site prediction.[3][4][5][6]
Comparative Genomics Approaches
As the entire genomes of many different species are sequenced, a promising direction in current research on gene finding is a comparative genomics approach. This is based on the principle that the forces of natural selection
cause genes and other functional elements undergo mutation at a slower
rate than the rest of the genome, since mutations in functional
elements are more likely to negatively impact the organism than
mutations elsewhere. Genes can thus be detected by comparing the
genomes of related species to detect this evolutionary pressure for
conservation. This approach was first applied to the mouse and human
genomes, using programs such as SLAM, SGP and Twinscan/N-SCAN.
Comparative gene finding can also be used to project high quality
annotations from one genome to another. Notable examples include
Projector, GeneWise and GeneMapper. Such techniques now play a central
role in the annotation of all genomes.
External links
References
- ^ Saeys Y, Rouzé P, Van de Peer Y (2007). "In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists". Bioinformatics 23 (4): 414–420. doi:10.1093/bioinformatics/btl639. PMID 17204465.
- ^ Hiller
M, Pudimat R, Busch A, Backofen R (2006). "Using RNA secondary
structures to guide sequence motif finding towards single-stranded
regions". Nucleic Acids Res 34 (17): e117. doi:10.1093/nar/gkl544. PMID 16987907. Entrez PubMed 16987907.
- ^ Patterson DJ, Yasuhara K, Ruzzo WL (2002). "Pre-mRNA secondary structure prediction aids splice site prediction". Pac Symp Biocomput: 223–234. Entrez PubMed 11928478.
- ^ Marashi
SA, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H (2006). "Importance of
RNA secondary structure information for yeast donor and acceptor splice
site predictions by neural networks". Comput Biol Chem 30 (1): 50–57. doi:10.1016/j.compbiolchem.2005.10.009. Entrez PubMed 16386465.
- ^ Marashi
SA, Eslahchi C, Pezeshk H, Sadeghi M (2006). "Impact of RNA structure
on the prediction of donor and acceptor splice sites". BMC Bioinformatics 7: 297. doi:10.1186/1471-2105-7-297. Entrez PubMed 16772025.
- ^ Rogic, S (2006). "The role of pre-mRNA secondary structure in gene splicing in Saccharomyces cerevisiae". PhD Dissertation, University of British Columbia.
This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia Encyclopedia article "Gene Finding"
|
|