Sequence Alignment of DNA, RNA, and Protein
See also:
In bioinformatics, a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix.
Gaps are inserted between the residues so that residues with identical
or similar characters are aligned in successive columns.
If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels
(that is, insertion or deletion mutations) introduced in one or both
lineages in the time since they diverged from one another. In protein
sequence alignment, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif
is among lineages. The absence of substitutions, or the presence of
only very conservative substitutions (that is, the substitution of
amino acids whose side chains
have similar biochemical properties) in a particular region of the
sequence, suggest that this region has structural or functional
importance. Although DNA and RNA nucleotide bases are more similar to each other than to amino acids, the conservation of base pairing
can indicate a similar functional or structural role. Sequence
alignment can be used for non-biological sequences, such as identifying
similarities in a series of letters and words present in human language.
In business, more specifically in marketing, sequences of purchases are
also increasingly being analyzed by the same methods as in
bioinformatics.[1]
Very short or very similar sequences can be aligned by hand;
however, most interesting problems require the alignment of lengthy,
highly variable or extremely numerous sequences that cannot be aligned
solely by human effort. Instead, human knowledge is primarily applied
in constructing algorithms to produce high-quality sequence alignments,
and occasionally in adjusting the final results to reflect patterns
that are difficult to represent algorithmically (especially in the case
of nucleotide sequences). Computational approaches to sequence
alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization
that "forces" the alignment to span the entire length of all query
sequences. By contrast, local alignments identify regions of similarity
within long sequences that are often widely divergent overall. Local
alignments are often preferable, but can be more difficult to calculate
because of the additional challenge of identifying the regions of
similarity. A variety of computational algorithms have been applied to
the sequence alignment problem, including slow but formally optimizing
methods like dynamic programming and efficient heuristic or probabilistic methods designed for large-scale database search.
Representations
Alignments are commonly represented both graphically and in text
format. In almost all sequence alignment representations, sequences are
written in rows arranged so that aligned residues appear in successive
columns. In text formats, aligned columns containing identical or
similar characters are indicated with a system of conservation symbols.
As in the image above, an asterisk or pipe symbol is used to show
identity between two columns; other less common symbols include a colon
for conservative substitutions and a period for semiconservative
substitutions. Many sequence visualization programs also use color to
display information about the properties of the individual sequence
elements; in DNA and RNA sequences, this equates to assigning each
nucleotide its own color. In protein alignments, such as the one in the
image above, color is often used to indicate amino acid properties to
aid in judging the conservativeness of a given amino acid substitution. For multiple sequences the last row in each column is often the consensus sequence determined by the alignment.
Sequence alignments can be stored in a wide variety of text-based
file formats, many of which were originally developed in conjunction
with a specific alignment program or implementation. Most web-based
tools allow a number of input and output formats, such as FASTA format and GenBank
format; however, the use of specific tools authored by individual
research laboratories can be complicated by limited file format
compatibility. A general conversion program is available at READSEQ.
Global and local alignments
Illustration of global and local alignments
demonstrating the 'gappy' quality of global alignments that can occur
if sequences are insufficiently similar
Global alignments, which attempt to align every residue in every
sequence, are most useful when the sequences in the query set are
similar and of roughly equal size. (This does not mean global
alignments cannot end in gaps.) A general global alignment technique is
called the Needleman-Wunsch algorithm
and is based on dynamic programming. Local alignments are more useful
for dissimilar sequences that are suspected to contain regions of
similarity or similar sequence motifs within their larger sequence
context. The Smith-Waterman algorithm
is a general local alignment method also based on dynamic programming.
With sufficiently similar sequences, there is no difference between
local and global alignments.
Hybrid methods, known as semiglobal or "glocal" methods, attempt to
find the best possible alignment that includes the start and end of one
or the other sequence. This can be especially useful when the
downstream part of one sequence overlaps with the upstream part of the
other sequence. In this case, neither global nor local alignment is
entirely appropriate: a global alignment would attempt to force the
alignment to extend beyond the region of overlap, while a local
alignment might not fully cover the region of overlap.[2]
Pairwise alignment
Pairwise sequence alignment methods are used to find the
best-matching piecewise (local) or global alignments of two query
sequences. Pairwise alignments can only be used between two sequences
at a time, but they are efficient to calculate and are often used for
methods that do not require extreme precision (such as searching a
database for sequences with high homology
to a query). The three primary methods of producing pairwise alignments
are dot-matrix methods, dynamic programming, and word methods[3];
however, most multiple sequence alignment techniques can align only two
sequences. Although each method has its individual strengths and
weaknesses, all three methods have difficulty with highly repetitive
sequences of low information content - especially where the number of
repetitions differ in the two sequences to be aligned.
Dot-matrix methods
The dot-matrix approach, which implicitly produces a family of
alignments for individual sequence regions, is qualitative and simple,
though time-consuming to analyze on a large scale. It is very easy to
visually identify certain sequence features—such as insertions,
deletions, repeats, or inverted repeats—from
a dot-matrix plot. To construct a dot-matrix plot, the two sequences
are written along the top row and leftmost column of a two-dimensional matrix and a dot is placed at any point where the characters in the appropriate columns match—this is a typical recurrence plot.
Some implementations vary the size or intensity of the dot depending on
the degree of similarity of the two characters, to accommodate
conservative substitutions. The dot plots of very closely related
sequences will appear as a single line along the matrix's main diagonal.
Dot plots can also be used to assess repetitiveness in a single
sequence. A sequence can be plotted against itself and regions that
share significant similarities will appear as lines off the main
diagonal. This effect can occur when a protein consists of multiple
similar structural domains.
Dynamic programming
The technique of dynamic programming can be applied to produce global alignments via the Needleman-Wunsch algorithm, and local alignments via the Smith-Waterman algorithm. In typical usage, protein alignments use a substitution matrix to assign scores to amino-acid matches or mismatches, and a gap penalty
for matching an amino acid in one sequence to a gap in the other. DNA
and RNA alignments may use a scoring matrix, but in practice often
simply assign a positive match score, a negative mismatch score, and a
negative gap penalty. (In standard dynamic programming, the score of
each amino acid position is independent of the identity of its
neighbors, and therefore base stacking effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.)
Dynamic programming can be useful in aligning nucleotide to protein
sequences, a task complicated by the need to take into account frameshift
mutations (usually insertions or deletions). The framesearch method
produces a series of global or local pairwise alignments between a
query nucleotide sequence and a search set of protein sequences, or
vice versa. Although the method is very slow, its ability to evaluate
frameshifts offset by an arbitrary number of nucleotides makes the
method useful for sequences containing large numbers of indels, which
can be very difficult to align with more efficient heuristic methods.
In practice, the method requires large amounts of computing power or a
system whose architecture is specialized for dynamic programming. The BLAST and EMBOSS
suites provide basic tools for creating translated alignments (though
some of these approaches take advantage of side-effects of sequence
searching capabilities of the tools). More general methods are
available from both commercial sources, such as FrameSearch, distributed as part of the Accelrys GCG package, and Open Source software such as Genewise.
The dynamic programming method is guaranteed to find an optimal
alignment given a particular scoring function; however, identifying a
good scoring function is often an empirical rather than a theoretical
matter. Although dynamic programming is extensible to more than two
sequences, it is prohibitively slow for large numbers of or extremely
long sequences.
Word methods
Word methods, also known as k-tuple methods, are heuristic
methods that are not guaranteed to find an optimal alignment solution,
but are significantly more efficient than dynamic programming. These
methods are especially useful in large-scale database searches where it
is understood that a large proportion of the candidate sequences will
have essentially no significant match with the query sequence. Word
methods are best known for their implementation in the database search
tools FASTA and the BLAST family.[3]
Word methods identify a series of short, nonoverlapping subsequences
("words") in the query sequence that are then matched to candidate
database sequences. The relative positions of the word in the two
sequences being compared are subtracted to obtain an offset; this will
indicate a region of alignment if multiple distinct words produce the
same offset. Only if this region is detected do these methods apply
more sensitive alignment criteria; thus, many unnecessary comparisons
with sequences of no appreciable similarity are eliminated.
In the FASTA method, the user defines a value k to use as the word length with which to search the database. The method is slower but more sensitive at lower values of k,
which are also preferred for searches involving a very short query
sequence. The BLAST family of search methods provides a number of
algorithms optimized for particular types of queries, such as searching
for distantly related sequence matches. BLAST was developed to provide
a faster alternative to FASTA without sacrificing much accuracy; like
FASTA, BLAST uses a word search of length k, but evaluates only
the most significant word matches, rather than every word match as does
FASTA. Most BLAST implementations use a fixed default word length that
is optimized for the query and database type, and that is changed only
under special circumstances, such as when searching with repetitive or
very short query sequences. Implementations can be found via a number
of web portals, such as EMBL FASTA and NCBI BLAST.
The IDF method identifies weighted n-gram sequence fragments in
large genomic databases whose indexing characteristics permit the
construction of indexed, sequence retrieval programs where query
processing time is determined mainly by the size of the query and
number of sequences retrieved rather than the number of sequences
scanned. The weighting scheme is based on the inverse document
frequency (IDF) method, a weighting formula that calculates the relative importance of indexing terms based on term distribution. GPL open-source application
Multiple sequence alignment
-
Multiple sequence alignment
(MSA) is an extension of pairwise alignment to incorporate more than
two sequences at a time. Multiple alignment methods try to align all of
the sequences in a given query set. Multiple alignments are often used
in identifying conserved
sequence regions across a group of sequences hypothesized to be
evolutionarily related. Such conserved sequence motifs can be used in
conjunction with structural and mechanistic information to locate the catalytic active sites of enzymes. Alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. MSAs are computationally difficult to produce and most formulations of the problem lead to NP-complete combinatorial optimization problems.[4]
Nevertheless, the utility of these alignments in bioinformatics has led
to the development of a variety of methods suitable for aligning three
or more sequences.
Dynamic programming
The technique of dynamic programming is theoretically applicable to
any number of sequences; however, because it is computationally
expensive in both time and memory, it is rarely used for more than
three or four sequences in its most basic form. This method requires
constructing the n-dimensional equivalent of the sequence matrix formed from two sequences, where n
is the number of sequences in the query. Standard dynamic programming
is first used on all pairs of query sequences and then the "alignment
space" is filled in by considering possible matches or gaps at
intermediate positions, eventually constructing an alignment
essentially between each two-sequence alignment. Although this
technique is computationally expensive, its guarantee of a global
optimum solution is useful in cases where only a few sequences need to
be aligned accurately. One method for reducing the computational
demands of dynamic programming, which relies on the "sum of pairs" objective function, has been implemented in the MSA software package.[5]
Progressive methods
Progressive, hierarchical, or tree methods generate an MSA by first
aligning the most similar sequences and then adding successively less
related sequences or groups to the alignment until the entire query set
has been incorporated into the solution. The initial tree describing
the sequence relatedness is based on pairwise comparisons that may
include heuristic pairwise alignment methods similar to FASTA.
Progressive alignment results are dependent on the choice of "most
related" sequences and thus can be sensitive to inaccuracies in the
initial pairwise alignments. Most progressive MSA methods additionally
weight the sequences in the query set according to their relatedness,
which reduces the likelihood of making a poor choice of initial
sequences and thus improves alignment accuracy.
Many variations of the Clustal progressive implementation[6][7] are used for multiple sequence alignment, phylogenetic tree construction, and as input for protein structure prediction. A slower but more accurate variant of the progressive method is known as T-Coffee[8]; implementations can be found at ClustalW and T-Coffee.
Iterative methods
Iterative methods attempt to improve on the weak point of the
progressive methods, the heavy dependence on the accuracy of the
initial pairwise alignments. Iterative methods optimize an objective function
based on a selected alignment scoring method by assigning an initial
global alignment and then realigning sequence subsets. The realigned
subsets are then themselves aligned to produce the next iteration's
MSA. Various ways of selecting the sequence subgroups and objective
function are reviewed in [9].
Motif finding
Motif finding, also known as profile analysis, constructs global MSAs that attempt to align short conserved sequence motifs
among the sequences in the query set. This is usually done by first
constructing a general global MSA, after which the highly conserved
regions are isolated and used to construct a set of profile matrices.
The profile matrix for each conserved region is arranged like a scoring
matrix but its frequency counts for each amino acid or nucleotide at
each position are derived from the conserved region's character
distribution rather than from a more general empirical distribution.
The profile matrices are then used to search other sequences for
occurrences of the motif they characterize. In cases where the original
data set contained a small number of sequences, or only highly related
sequences, pseudocounts are added to normalize the character distributions represented in the motif.
Techniques inspired by computer science
A variety of general optimization algorithms commonly used in computer science have also been applied to the multiple sequence alignment problem. Most recently hidden Markov models have been used to produce probability scores for a family of possible MSAs for a given query set. Genetic algorithms and simulated annealing
have also been used in optimizing MSA scores as judged by a scoring
function like the sum-of-pairs method. More complete details and
software packages can be found in the main article multiple sequence alignment.
Structural alignment
-
Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the secondary and tertiary structure
of the protein or RNA molecule to aid in aligning the sequences. These
methods can be used for two or more sequences and typically produce
local alignments; however, because they depend on the availability of
structural information, they can only be used for sequences whose
corresponding structures are known (usually through X-ray crystallography or NMR spectroscopy). Because both protein and RNA structure is more evolutionarily conserved than sequence[10],
structural alignments can be more reliable between sequences that are
very distantly related and that have diverged so extensively that
sequence comparison cannot reliably detect their similarity.
Structural alignments are used as the "gold standard" in evaluating alignments for homology-based protein structure prediction[11]
because they explicitly align regions of the protein sequence that are
structurally similar rather than relying exclusively on sequence
information. However, clearly structural alignments cannot be used in
structure prediction because at least one sequence in the query set is
the target to be modeled, for which the structure is not known. It has
been shown that, given the structural alignment between a target and a
template sequence, highly accurate models of the target protein
sequence can be produced; a major stumbling block in homology-based
structure prediction is the production of structurally accurate
alignments given only sequence information.[11]
DALI
The DALI method, or distance matrix alignment, is a fragment-based
method for constructing structural alignments based on contact
similarity patterns between successive hexapeptides in the query
sequences.[12] It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the Protein Data Bank (PDB). It has been used to construct the FSSP
structural alignment database (Fold classification based on
Structure-Structure alignment of Proteins, or Families of Structurally
Similar Proteins). A DALI webserver can be accessed at EBI DALI and the FSSP is located at The Dali Database.
SSAP
SSAP (sequential structure alignment program) is a dynamic
programming-based method of structural alignment that uses atom-to-atom
vectors in structure space as comparison points. It has been extended
since its original description to include multiple as well as pairwise
alignments[13], and has been used in the construction of the CATH (Class, Architecture, Topology, Homology) hierarchical database classification of protein folds.[14] The CATH database can be accessed at CATH Protein Structure Classification.
Combinatorial extension
The combinatorial extension (CE) method of structural alignment
generates a pairwise structural alignment by using local geometry to
align short fragments of the two proteins being analyzed and then
assembles these fragments into a larger alignment[15]. Based on measures such as rigid-body root mean square distance, residue distances, local secondary structure, and surrounding environmental features such as residue neighbor hydrophobicity,
local alignments called "aligned fragment pairs" (AFPs) are generated
and used to build a similarity matrix representing all possible
structural alignments within predefined cutoff criteria. A path from
one protein structure state to the other is then traced through the
matrix by extending the growing alignment one fragment at a time. The
optimal such path defines the CE alignment. A web-based server
implementing the method and providing a database of pairwise alignments
of structures in the PDB is located at the Combinatorial Extension website.
Phylogenetic analysis
-
Phylogenetics and sequence alignment are closely related fields due
to the shared necessity of evaluating sequence relatedness. The field
of phylogenetics makes extensive use of sequence alignments in the construction and interpretation of phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes
of divergent species. The degree to which sequences in a query set
differ is qualitatively related to the sequences' evolutionary distance
from one another. Roughly speaking, high sequence homology suggests
that the sequences in question have a comparatively young most recent common ancestor, while low homology suggests that the divergence is more ancient. This approximation, which reflects the "molecular clock"
hypothesis that a roughly constant rate of evolutionary change can be
used to extrapolate the elapsed time since two genes first diverged
(that is, the coalescence time), assumes that the effects of mutation and selection
are constant across sequence lineages. Therefore it does not account
for possible difference among organisms or species in the rates of DNA repair
or the possible functional conservation of specific regions in a
sequence. (In the case of nucleotide sequences, the molecular clock
hypothesis in its most basic form also discounts the difference in
acceptance rates between silent mutations that do not alter the meaning of a given codon and other mutations that result in a different amino acid
being incorporated into the protein.) More statistically accurate
methods allow the evolutionary rate on each branch of the phylogenetic
tree to vary, thus producing better estimates of coalescence times for
genes.
Progressive multiple alignment techniques produce a phylogenetic
tree by necessity because they incorporate sequences into the growing
alignment in order of relatedness. Other techniques that assemble MSAs
and phylogenetic trees score and sort trees first and calculate an MSA
from the highest-scoring tree. Commonly used methods of phylogenetic
tree construction are mainly heuristic because the problem of selecting the optimal tree, like the problem of selecting the optimal MSA, is NP-hard.[16]
Assessment of significance
Sequence alignments are useful in bioinformatics for identifying
sequence similarity, producing phylogenetic trees, and developing
homology models of protein structures. However, the biological
relevance of sequence alignments is not always clear. Alignments are
often assumed to reflect a degree of evolutionary change between
sequences descended from a common ancestor; however, it is formally
possible that convergent evolution
can occur to produce apparent similarity between proteins that are
evolutionarily unrelated but perform similar functions and have similar
structures.
In database searches such as BLAST, statistical methods can
determine the likelihood of a particular alignment between sequences or
sequence regions arising by chance given the size and composition of
the database being searched. These values can vary significantly
depending on the search space. In particular, the likelihood of finding
a given alignment by chance increases if the database consists only of
sequences from the same organism as the query sequence. Repetitive
sequences in the database or query can also distort both the search
results and the assessment of statistical significance; BLAST
automatically filters such repetitive sequences in the query to avoid
apparent hits that are statistical artifacts.
Scoring functions
The choice of a scoring function that reflects biological or
statistical observations about known sequences is important to
producing good alignments. Protein sequences are frequently aligned
using substitution matrices that reflect the probabilities of given character-to-character substitutions. A series of matrices called PAM matrices (Percent Accepted Mutation matrices, originally defined by Margaret Dayhoff
and sometimes referred to as "Dayhoff matrices") explicitly encode
evolutionary approximations regarding the rates and probabilities of
particular amino acid mutations. Another common series of scoring
matrices, known as BLOSUM
(Blocks Substitution Matrix), encodes empirically derived substitution
probabilities. Variants of both types of matrices are used to detect
sequences with differing levels of divergence, thus allowing users of
BLAST or FASTA to restrict searches to more closely related matches or
expand to detect more divergent sequences. Gap penalties
account for the introduction of a gap - on the evolutionary model, an
insertion or deletion mutation - in both nucleotide and protein
sequences, and therefore the penalty values should be proportional to
the expected rate of such mutations. The quality of the alignments
produced therefore depends on the quality of the scoring function.
It can be very useful and instructive to try the same alignment
several times with different choices for scoring matrix and/or gap
penalty values and compare the results. By noting which regions look
similar no matter what settings are used, and which look different, one
can get an excellent sense of where the algorithm had difficulty
finding a robust solution.
Software
-
Common software tools used for general sequence alignment tasks include ClustalW and T-coffee for alignment, and BLAST for database searching. A more complete list of available software categorized by algorithm and alignment type is available at sequence alignment software.
Alignment algorithms and software can be directly compared to one another using a standardized set of benchmark reference multiple sequence alignments known as BAliBASE.[17]
The dataset consists of structural alignments, which can be considered
a standard against which purely sequence-based methods are compared.
The relative performance of many common alignment methods on frequently
encountered alignment problems has been tabulated and selected results
published online at BAliBASE.[18]
See also
References
- ^ Prinzie A., D. Van den Poel (2006), Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM. Decision Support Systems 42 (2): 508-526.
- ^
Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou
S. (2003). Glocal alignment: finding rearrangements during alignment. Bioinformatics 19 Suppl 1:i54–62.
- ^ a b Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.
- ^ Wang L, Jiang T. (1994) On the complexity of multiple sequence alignment. J Comput Biol 1:337-348.
- ^ Lipman DJ, Altschul SF, Kececioglu JD. (1989). A tool for multiple sequence alignment. Proc Natl Acad Sci USA 86:4412-4415.
- ^ Higgins DG, Sharp PM. (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73(1):237-44.
- ^
Thompson JD, Higgins DG, Gibson TJ. (1994). CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence
weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680.
- ^ Notredame C, Higgins DG, Heringa J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205-17.
- ^
Hirosawa M, Totoki Y, Hoshida M, Ishikawa M. (1995). Comprehensive
study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 11:13-18.
- ^ Chothia C, Lesk AM. (1986). The relation between the divergence of sequence and structure in proteins. EMBO J 5(4):823-6.
- ^ a b Zhang Y, Skolnick J. (2005). The protein structure prediction problem could be solved using the current PDB library. Proc Natl Acad Sci USA 102(4):1029-34.
- ^ Holm L, Sander C (1996). Mapping the protein universe. Science 273: 595-603.
- ^ Taylor WR, Flores TP, Orengo CA. (1994). Multiple protein structure alignment. Protein Sci 3(10):1858-70.
- ^
Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM.
(1997) CATH: A hierarchical classification of protein domain
structures. Structure. 5(8): 1093-1108.
- ^ Shindyalov IN, Bourne PE (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 11(9) 739-747.
- ^ Felsenstein J. (2004). Inferring Phylogenies Sinauer Associates: Sunderland, MA.
- ^
Thompson JD, Plewniak F, Poch O. (1999). BAliBASE: a benchmark
alignment database for the evaluation of multiple alignment programs. Bioinformatics 15: 87-88.
- ^ Thompson JD, Plewniak F, Poch O. (1999). A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 7(13):2682-90.
This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia Encyclopedia article "Sequence Alignment"
|