Pairwise Alignment
See also:
Pairwise sequence alignment methods are used to find the
best-matching piecewise (local) or global alignments of two query
sequences. Pairwise alignments can only be used between two sequences
at a time, but they are efficient to calculate and are often used for
methods that do not require extreme precision (such as searching a
database for sequences with high homology
to a query). The three primary methods of producing pairwise alignments
are dot-matrix methods, dynamic programming, and word methods;[3]
however, multiple sequence alignment techniques can also align pairs of
sequences. Although each method has its individual strengths and
weaknesses, all three pairwise methods have difficulty with highly
repetitive sequences of low information content
- especially where the number of repetitions differ in the two
sequences to be aligned. One way of quantifying the utility of a given
pairwise alignment is the 'maximum unique match', or the longest
subsequence that occurs in both query sequence. Longer MUM sequences
typically reflect closer relatedness.
Dot-matrix methods
A DNA dot plot of a human zinc finger transcription factor (GenBank ID NM_002383), showing regional self-similarity.
The main diagonal represents the sequence's alignment with itself;
lines off the main diagonal represent similar or repetitive patterns
within the sequence. This is a typical example of a recurrence plot.
The dot-matrix approach, which implicitly produces a family of
alignments for individual sequence regions, is qualitative and simple,
though time-consuming to analyze on a large scale. It is very easy to
visually identify certain sequence features—such as insertions,
deletions, repeats, or inverted repeats—from
a dot-matrix plot. To construct a dot-matrix plot, the two sequences
are written along the top row and leftmost column of a two-dimensional matrix and a dot is placed at any point where the characters in the appropriate columns match—this is a typical recurrence plot.
Some implementations vary the size or intensity of the dot depending on
the degree of similarity of the two characters, to accommodate
conservative substitutions. The dot plots of very closely related
sequences will appear as a single line along the matrix's main diagonal.
Dot plots can also be used to assess repetitiveness in a single
sequence. A sequence can be plotted against itself and regions that
share significant similarities will appear as lines off the main
diagonal. This effect can occur when a protein consists of multiple
similar structural domains.
Dynamic programming
The technique of dynamic programming can be applied to produce global alignments via the Needleman-Wunsch algorithm, and local alignments via the Smith-Waterman algorithm. In typical usage, protein alignments use a substitution matrix to assign scores to amino-acid matches or mismatches, and a gap penalty
for matching an amino acid in one sequence to a gap in the other. DNA
and RNA alignments may use a scoring matrix, but in practice often
simply assign a positive match score, a negative mismatch score, and a
negative gap penalty. (In standard dynamic programming, the score of
each amino acid position is independent of the identity of its
neighbors, and therefore base stacking effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.)
Dynamic programming can be useful in aligning nucleotide to protein
sequences, a task complicated by the need to take into account frameshift
mutations (usually insertions or deletions). The framesearch method
produces a series of global or local pairwise alignments between a
query nucleotide sequence and a search set of protein sequences, or
vice versa. Although the method is very slow, its ability to evaluate
frameshifts offset by an arbitrary number of nucleotides makes the
method useful for sequences containing large numbers of indels, which
can be very difficult to align with more efficient heuristic methods.
In practice, the method requires large amounts of computing power or a
system whose architecture is specialized for dynamic programming. The BLAST and EMBOSS
suites provide basic tools for creating translated alignments (though
some of these approaches take advantage of side-effects of sequence
searching capabilities of the tools). More general methods are
available from both commercial sources, such as FrameSearch, distributed as part of the Accelrys GCG package, and Open Source software such as Genewise.
The dynamic programming method is guaranteed to find an optimal
alignment given a particular scoring function; however, identifying a
good scoring function is often an empirical rather than a theoretical
matter. Although dynamic programming is extensible to more than two
sequences, it is prohibitively slow for large numbers of or extremely
long sequences.
Word methods
Word methods, also known as k-tuple methods, are heuristic
methods that are not guaranteed to find an optimal alignment solution,
but are significantly more efficient than dynamic programming. These
methods are especially useful in large-scale database searches where it
is understood that a large proportion of the candidate sequences will
have essentially no significant match with the query sequence. Word
methods are best known for their implementation in the database search
tools FASTA and the BLAST family.[3]
Word methods identify a series of short, nonoverlapping subsequences
("words") in the query sequence that are then matched to candidate
database sequences. The relative positions of the word in the two
sequences being compared are subtracted to obtain an offset; this will
indicate a region of alignment if multiple distinct words produce the
same offset. Only if this region is detected do these methods apply
more sensitive alignment criteria; thus, many unnecessary comparisons
with sequences of no appreciable similarity are eliminated.
In the FASTA method, the user defines a value k to use as the word length with which to search the database. The method is slower but more sensitive at lower values of k,
which are also preferred for searches involving a very short query
sequence. The BLAST family of search methods provides a number of
algorithms optimized for particular types of queries, such as searching
for distantly related sequence matches. BLAST was developed to provide
a faster alternative to FASTA without sacrificing much accuracy; like
FASTA, BLAST uses a word search of length k, but evaluates only
the most significant word matches, rather than every word match as does
FASTA. Most BLAST implementations use a fixed default word length that
is optimized for the query and database type, and that is changed only
under special circumstances, such as when searching with repetitive or
very short query sequences. Implementations can be found via a number
of web portals, such as EMBL FASTA and NCBI BLAST.
This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia Encyclopedia article "Sequence Alignment"
|
|