Multiple Sequence Alignment
See also:
Alignment of 27 avian influenza hemagglutinin protein sequences colored by residue conservation (top) and residue properties (bottom)
Multiple sequence alignment
is an extension of pairwise alignment to incorporate more than two
sequences at a time. Multiple alignment methods try to align all of the
sequences in a given query set. Multiple alignments are often used in
identifying conserved
sequence regions across a group of sequences hypothesized to be
evolutionarily related. Such conserved sequence motifs can be used in
conjunction with structural and mechanistic information to locate the catalytic active sites of enzymes. Alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to NP-complete combinatorial optimization problems.[4]
Nevertheless, the utility of these alignments in bioinformatics has led
to the development of a variety of methods suitable for aligning three
or more sequences.
Dynamic programming
The technique of dynamic programming is theoretically applicable to
any number of sequences; however, because it is computationally
expensive in both time and memory, it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the n-dimensional equivalent of the sequence matrix formed from two sequences, where n
is the number of sequences in the query. Standard dynamic programming
is first used on all pairs of query sequences and then the "alignment
space" is filled in by considering possible matches or gaps at
intermediate positions, eventually constructing an alignment
essentially between each two-sequence alignment. Although this
technique is computationally expensive, its guarantee of a global
optimum solution is useful in cases where only a few sequences need to
be aligned accurately. One method for reducing the computational
demands of dynamic programming, which relies on the "sum of pairs" objective function, has been implemented in the MSA software package.[5]
Progressive methods
Progressive, hierarchical, or tree methods generate a multiple
sequence alignment by first aligning the most similar sequences and
then adding successively less related sequences or groups to the
alignment until the entire query set has been incorporated into the
solution. The initial tree describing the sequence relatedness is based
on pairwise comparisons that may include heuristic pairwise alignment
methods similar to FASTA.
Progressive alignment results are dependent on the choice of "most
related" sequences and thus can be sensitive to inaccuracies in the
initial pairwise alignments. Most progressive multiple sequence
alignment methods additionally weight the sequences in the query set
according to their relatedness, which reduces the likelihood of making
a poor choice of initial sequences and thus improves alignment accuracy.
Many variations of the Clustal progressive implementation[6][7][8] are used for multiple sequence alignment, phylogenetic tree construction, and as input for protein structure prediction. A slower but more accurate variant of the progressive method is known as T-Coffee;[9] implementations can be found at ClustalW and T-Coffee.
Iterative methods
Iterative methods attempt to improve on the weak point of the
progressive methods, the heavy dependence on the accuracy of the
initial pairwise alignments. Iterative methods optimize an objective function
based on a selected alignment scoring method by assigning an initial
global alignment and then realigning sequence subsets. The realigned
subsets are then themselves aligned to produce the next iteration's
multiple sequence alignment. Various ways of selecting the sequence
subgroups and objective function are reviewed in.[10]
Motif finding
Motif finding, also known as profile analysis, constructs global
multiple sequence alignments that attempt to align short conserved sequence motifs
among the sequences in the query set. This is usually done by first
constructing a general global multiple sequence alignment, after which
the highly conserved
regions are isolated and used to construct a set of profile matrices.
The profile matrix for each conserved region is arranged like a scoring
matrix but its frequency counts for each amino acid or nucleotide at
each position are derived from the conserved region's character
distribution rather than from a more general empirical distribution.
The profile matrices are then used to search other sequences for
occurrences of the motif they characterize. In cases where the original
data set contained a small number of sequences, or only highly related sequences, pseudocounts are added to normalize the character distributions represented in the motif.
Techniques inspired by computer science
A variety of general optimization algorithms commonly used in computer science have also been applied to the multiple sequence alignment problem. Hidden Markov models
have been used to produce probability scores for a family of possible
multiple sequence alignments for a given query set; although early
HMM-based methods produced underwhelming performance, later
applications have found them especially effective in detecting remotely
related sequences because they are less susceptible to noise created by
conservative or semiconservative substitutions.[11] Genetic algorithms and simulated annealing
have also been used in optimizing multiple sequence alignment scores as
judged by a scoring function like the sum-of-pairs method. More
complete details and software packages can be found in the main article
multiple sequence alignment.
This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia Encyclopedia article "Multiple Sequence Alignment"
|