Viterbi Algorithm & Viterbi Decoder
The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states – called the Viterbi path – that results in a sequence of observed events, especially in the context of hidden Markov models. The forward algorithm
is a closely related algorithm for computing the probability of a
sequence of observed events. These algorithms form a subset of information theory.
A viterbi decoder (see below) uses the Viterbi algorithm for decoding a bitstream that has been encoded using Forward error correction based on a Convolutional code.
The algorithm makes a number of assumptions. First, both the
observed events and hidden events must be in a sequence. This sequence
often corresponds to time. Second, these two sequences need to be
aligned, and an observed event needs to correspond to exactly one
hidden event. Third, computing the most likely hidden sequence up to a
certain point t must depend only on the observed event at point t, and the most likely sequence at point t − 1. These assumptions are all satisfied in a first-order hidden Markov model.
The terms "Viterbi path" and "Viterbi algorithm" are also applied to
related dynamic programming algorithms that discover the single most
likely explanation for an observation. For example, in stochastic parsing
a dynamic programming algorithm can be used to discover the single most
likely context-free derivation (parse) of a string, which is sometimes
called the "Viterbi parse".
The Viterbi algorithm was conceived by Andrew Viterbi as an error-correction scheme for noisy digital communication links, finding universal application in decoding the convolutional codes used in both CDMA and GSM digital cellular, dial-up modems, satellite, deep-space communications, and 802.11 wireless LANs. It is now also commonly used in speech recognition, keyword spotting, computational linguistics, and bioinformatics. For example, in speech-to-text
(speech recognition), the acoustic signal is treated as the observed
sequence of events, and a string of text is considered to be the
"hidden cause" of the acoustic signal. The Viterbi algorithm finds the
most likely string of text given the acoustic signal.
Overview
First, we elaborate on the assumptions listed above. The Viterbi
algorithm operates on a state machine assumption. That is, at any
system we are modeling is in some state. There are a finite number of
states, however large, that can be list "survivor path". This is a
fundamental assumption of the algorithm because the algorithm will
examine all possible paths leading to a state and only keep the one
most likely. This way the algorithm does not have to keep track of all
possible paths, only one per state.
A second key assumption is that a transition from a previous state
to a new state is marked by an incremental metric, usually a number.
This transition is computed from the event. The third key assumption is
that the events are cumulative over a path in some sense, usually
additive. So the crux of the algorithm is to keep a number for each
state. When an event occurs, the algorithm examines moving forward to a
new set of states by combining the metric of a possible previous state
with the incremental metric of the transition due to the event and
chooses the best. The incremental metric associated with an event
depends on the transition possibility from the old state to the new
state. For example in data communications, it may be possible to only
transmit half the symbols from an odd numbered state and the other half
from an even numbered state. Additionally, in many cases the state
transition graph is not fully connected. A simple example is a car that
has 3 states — forward, stop and reverse — and a transition from
forward to reverse is not allowed. It must first enter the stop state.
After computing the combinations of incremental metric and state
metric, only the best survives and all other paths are discarded. There
are modifications to the basic algorithm which allow for a forward
search in addition to the backwards one described here.
Path history must be stored. In some cases, the search history is
finite because the state machine at the encoder starts in a known state
and there is sufficient memory to keep all the paths. In other cases, a
programmatic solution must be found for limited resources: one example
is convolutional encoding, where the decoder must truncate the history
at a depth large enough to keep performance to an acceptable level.
Although the Viterbi algorithm is very efficient and there are
modifications that reduce the computational load, the memory
requirements tend to remain constant.
A concrete example
Assume you have a friend who lives far away and to whom you talk
daily over the telephone about what he did that day. Your friend is
only interested in three activities: walking in the park, shopping, and
cleaning his apartment. The choice of what to do is determined
exclusively by the weather on a given day. You have no definite
information about the weather where your friend lives, but you know
general trends. Based on what he tells you he did each day, you try to
guess what the weather must have been like.
You believe that the weather operates as a discrete Markov chain. There are two states, "Rainy" and "Sunny", but you cannot observe them directly, that is, they are hidden
from you. On each day, there is a certain chance that your friend will
perform one of the following activities, depending on the weather:
"walk", "shop", or "clean". Since your friend tells you about his
activities, those are the observations. The entire system is that of a hidden Markov model (HMM).
You know the general weather trends in the area, and what your
friend likes to do on average. In other words, the parameters of the
HMM are known. You can write them down in the Python programming language:
states = ('Rainy', 'Sunny')
observations = ('walk', 'shop', 'clean')
start_probability = {'Rainy': 0.6, 'Sunny': 0.4}
transition_probability = {
'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},
'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},
}
emission_probability = {
'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},
'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1},
}
In this piece of code, start_probability represents
your belief about which state the HMM is in when your friend first
calls you (all you know is that it tends to be rainy on average). The
particular probability distribution used here is not the equilibrium
one, which is (given the transition probabilities) actually
approximately {'Rainy': 0.571, 'Sunny': 0.429}. The transition_probability
represents the change of the weather in the underlying Markov chain. In
this example, there is only a 30% chance that tomorrow will be sunny if
today is rainy. The emission_probability represents how
likely your friend is to perform a certain activity on each day. If it
is rainy, there is a 50% chance that he is cleaning his apartment; if
it is sunny, there is a 60% chance that he is outside for a walk.
You talk to your friend three days in a row and discover that on the
first day he went for a walk, on the second day he went shopping, and
on the third day he cleaned his apartment. You have two questions: What
is the overall probability of this sequence of observations? And what
is the most likely sequence of rainy/sunny days that would explain
these observations? The first question is answered by the forward
algorithm; the second question is answered by the Viterbi algorithm.
These two algorithms are structurally so similar (in fact, they are
both instances of the same abstract algorithm) that they can be
implemented in a single function:
def forward_viterbi(y, X, sp, tp, ep):
T = {}
for state in X:
## prob. V. path V. prob.
T[state] = (sp[state], [state], sp[state])
for output in y:
U = {}
for next_state in X:
total = 0
argmax = None
valmax = 0
for source_state in X:
(prob, v_path, v_prob) = T[source_state]
p = ep[source_state][output] * tp[source_state][next_state]
prob *= p
v_prob *= p
total += prob
if v_prob > valmax:
argmax = v_path + [next_state]
valmax = v_prob
U[next_state] = (total, argmax, valmax)
T = U
## apply sum/max to the final states:
total = 0
argmax = None
valmax = 0
for state in X:
(prob, v_path, v_prob) = T[state]
total += prob
if v_prob > valmax:
argmax = v_path
valmax = v_prob
return (total, argmax, valmax)
The function forward_viterbi takes the following arguments: y is the sequence of observations, e.g. ['walk', 'shop', 'clean']; X is the set of hidden states; sp is the start probability; tp are the transition probabilities; and ep are the emission probabilities.
The algorithm works on the mappings T and U. Each is a mapping from a state to a triple (prob, v_path, v_prob), where prob is the total probability of all paths from the start to the current state, v_path is the Viterbi path up to the current state, and v_prob is the probability of the Viterbi path up to the current state. The mapping T holds this information for a given point t in time, and the main loop constructs U, which holds similar information for time t+1. Because of the Markov property, information about any point in time prior to t is not needed.
The algorithm begins by initializing T to the start
probabilities: the total probability for a state is just the start
probability of that state; and the Viterbi path to a start state is the
singleton path consisting only of that state; the probability of the
Viterbi path is the same as the start probability.
The main loop considers the observations from y in sequence. It is loop invariant because T
contains the correct information up to but excluding the point in time
of the current observation. The algorithm then computes the triple (prob, v_path, v_prob) for each possible next state. The total probability of a given next state, total
is obtained by adding up the probabilities of all paths reaching that
state. More precisely, the algorithm iterates over all possible source
states. For each source state, T holds the total
probability of all paths to that state. This probability is then
multiplied by the emission probability of the current observation and
the transition probability from the source state to the next state. The
resulting probability prob is then added to total.
The probability of the Viterbi path is computed in a similar fashion,
but instead of adding across all paths one performs a discrete
maximization. Initially the maximum value valmax is zero.
For each source state, the probability of the Viterbi path to that
state is known. This too is multiplied with the emission and transition
probabilities and replaces valmax if it is greater than its current value. The Viterbi path itself is computed as the corresponding argmax of that maximization, by extending the Viterbi path that leads to the current state with the next state. The triple (prob, v_path, v_prob) computed in this fashion is stored in U and once U has been computed for all possible next states, it replaces T, thus ensuring that the loop invariant holds at the end of the iteration.
In the end another summation/maximization is performed (this could
also be done inside the main loop by adding a pseudo-observation after
the last real observation).
In the running example, the forward/Viterbi algorithm is used as follows:
def example():
return forward_viterbi(observations,
states,
start_probability,
transition_probability,
emission_probability)
print example()
This reveals that the total probability of ['walk', 'shop', 'clean'] is 0.033612 and that the Viterbi path is ['Sunny', 'Rainy', 'Rainy', 'Rainy'].
The Viterbi path contains four states because the third observation was
generated by the third state and a transition to the fourth state. In
other words, given the observed activities, it was most likely sunny
when your friend went for a walk and then it started to rain the next
day and kept on raining.
When implementing this algorithm, it should be noted that many languages use Floating Point
arithmetic - as p is small, this may lead to underflow in the results.
A common technique to avoid this is to take the logarithm of the
probabilities and use it throughout the computation, the same technique
used in the Logarithmic Number System. Once the algorithm has terminated, an accurate value can be obtained by performing the appropriate exponentiation.
Extensions
With the algorithm called iterative Viterbi decoding
one can find the subsequence of an observation that matches best (on
average) to a given HMM. Iterative Viterbi decoding works by
iteratively invoking a modified Viterbi algorithm, reestimating the
score for a filler until convergence.
An alternate algorithm, the Lazy Viterbi algorithm,
has been proposed recently. This works by not expanding any nodes until
it really needs to, and usually manages to get away with doing a lot
less work (in software) than the ordinary Viterbi algorithm for the
same result - however, it is not so easy to parallelize in hardware.
See also
References
- J Feldman, I Abou-Faycal and M Frigo. A Fast Maximum-Likelihood Decoder for Convolutional Codes.
External links
Viterbi Decoder
A viterbi decoder uses the Viterbi algorithm for decoding a bitstream that has been encoded using Forward error correction based on a Convolutional code.
There are other algorithms for decoding a convolutionally encoded stream (for example, the Fano algorithm). The Viterbi algorithm is the most resource-consuming, but it does the maximum likelihood
decoding. It is most often used for decoding convolutional codes with
constraint lengths k<=10, but values up to k=15 are used in practice.
Viterbi decoding was developed by Andrew J. Viterbi and published in the paper "Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm", IEEE Transactions on Information Theory, Volume IT-13, pages 260-269, in April, 1967.
There are both hardware (in modems) and software implementations of a Viterbi decoder.
Hardware implementation
A common way to implement a hardware viterbi decoder
A hardware viterbi decoder for basic (not perforated) code usually consists of the following major blocks:
- Branch metric unit (BMU)
- Path metric unit (PMU)
- Traceback unit (TBU)
Branch metric unit (BMU)
A sample implementation of a branch metric unit
A branch metric unit's function is to calculate branch metrics, which are normed distances between every possible symbol in the code alphabet, and the received symbol.
There are hard decision and soft decision viterbi decoders. The
decoder of the first type receives a simple bitstream on its input. In
this case a Hamming distance is used as a metric. The second-type decoder receives a bitstream containing information about the reliability of each received symbol. Usually this information is encoded as follows (the table is shown for 3-bit input):
| value |
meaning |
| 000 |
strongest 0 |
| 001 |
relatively strong 0 |
| 010 |
relatively weak 0 |
| 011 |
weakest 0 |
| 100 |
weakest 1 |
| 101 |
relatively weak 1 |
| 110 |
relatively strong 1 |
| 111 |
strongest 1 |
Of course, it is not the only way to encode reliablility data.
The squared Euclidean distance is used as a metric for soft decision decoders.
Path metric unit (PMU)
A sample implementation of a path metric unit for a specific K=4 decoder
A path metric unit summarizes branch metrics to get metrics for 2K − 1 paths, one of which can eventually be chosen as optimal. Every clock it makes 2K − 1
decisions, throwing off wittingly nonoptimal paths. The results of
these decisions are written to the memory of a traceback unit.
The core elements of a PMU are ACS (Add-Compare-Select) units. The way in which they are connected between themselves is defined by a specific code's trellis diagram.
Since branch metrics are always ,
there must be an additional circuit preventing metric counters from
overflow (it isn't shown on the image). An alternate method that
eliminates the need to monitor the path metric growth is to allow the
path metrics to "roll over", to use this method it is necessary to make
sure the path metric accumulators contain enough bits to prevent the
"best" and "worst" values from coming within 2(n-1) of each other. The compare circuit is essentially unchanged.
A sample implementation of an ACS unit
It is possible to monitor the noise level on the incoming bit stream
by monitoring the rate of growth of the "best" path metric, a simpler
way to do this is to monitor a single location or "state" and watch it
pass "upward" through say four discrete levels within the range of the
accumulator. As it passes upward through each of these thresholds, a
counter is incremented that reflects the "noise" present on the
incoming signal.
Traceback unit (TBU)
A sample implementation of a traceback unit
Back-trace unit restores an (almost) maximum-likelihood path from
the decisions made by PMU. Since it does it in inverse direction, a
viterbi decoder comprises a FILO (first-in-last-out) buffer to
reconstruct a correct order.
Note that the implementation shown on the image requires double
frequency. There are some tricks that eliminate this requirement.
Implementation issues
Euclidean metric computation
The squared norm distance (l2) distance between
the received and the actual symbols in the code alphabet may be further
simplified into a linear sum/difference form, which makes it less
computationally intensive.
Consider a 1/2 convolutional coder, which generates 2 bits (00, 01, 10 or 11) for every input bit (1 or 0). These Return-to-Zero signals are translated into a Non-Return-to-Zero form shown alongside.
| code alphabet |
vector mapping |
| 00 |
1, 1 |
| 01 |
1, -1 |
| 10 |
-1, 1 |
| 11 |
-1, -1 |
Each received symbol may be represented in vector form as vr = {r0, r1}, where r0 and r1 are soft decision values, whose magnitudes signify the joint reliability of the received vector, vr.
Every symbol in the code alphabet may, likewise, be represented by the vector vi = {±1, ±1}.
The actual computation of the Euclidean distance metric is:
D = ||vr - vi||2
= ||vr||2 - 2 (vr . vi) + ||vi||2
Each square term is a normed distance, depicting the energy of the symbol. For ex., the energy of the symbol vi = {±1, ±1} may be computed as
||vi||2 = (±1)2 + (±1)2 = 2
Thus, the energy term of all symbols in the code alphabet is constant (at (normalized) value 2).
The Add-Compare-Select (ACS) operation compares the metric distance between the received symbol ||vr|| and any 2 sumbols in the code alphabet whose paths merge at a node in the corresponding trellis, ||vi(0)|| and ||vi(1)||. This is equivalent to comparing
D0 = ||vr||2 - 2 (vr . vi(0)) + ||vi(0)||2
and
D1 = ||vr||2 - 2 (vr . vi(1)) + ||vi(1)||2
But, from above we know that the energy of vi is constant (equal to (normalized) value of 2), and the energy of vr is the same in both cases. This reduces the comparison to a minima function between the 2 (middle) dot product terms,
min((- 2 (vr . vi(0)), (- 2 (vr . vi(1)))
= max((vr . vi(0)), (vr . vi(1)))
since a min operation on negative numbers may be interpreted as an equivalent max operation on positive quantities.
Each dot product term may be expanded as
max(({r0, r1} . {±1, ±1}), ({r0, r1} . {±1, ±1}))
= max(± r0 ± r1, ± r0 ± r1)
where, the signs of each term depend on symbols, vi(0) and vi(1), being compared. Thus, the squared Euclidean metric distance calculation to compute the branch metric may be performed with a simple add/subtract operation.
Traceback
The general approach to traceback is to accumulate path metrics for up to five times the constraint length (5 * (K - 1)), find the node with the largest accumulated cost, and begin traceback from this node.
However, computing the node which has accumulated the largest cost
(either the largest or smallest integral path metric) involves finding
the maxima or minima of several (usually 2K-1) numbers, which may be time consuming when implemented on embedded hardware systems.
Most communication systems employing Viterbi decoding involving data packets of fixed sizes, with a fixed bit/byte pattern either at the beginning or/and at the end of the data packet. By using the known bit/byte
pattern as reference, the start node may be set to a fixed value,
thereby obtaining a perfect Maximum Likelihood Path during traceback.
Limitations
Generally speaking, the viterbi decoder doesn't produce an exact maximum-likelihood stream due to these limitations:
- Quantizing of an input signal, branch and path metrics.
- Limited traceback length.
Perforated codes
A hardware viterbi decoder of perforated codes is commonly implemented in such a way:
- A deperforator, which transforms the input stream into the stream
which looks like an original (imperforated) stream with ERASE marks at
the places where bits were erased.
- A basic viterbi decoder understanding these ERASE marks (that is, not using them for branch metric calculation).
Software implementation
See Viterbi algorithm.
One of the most time-consuming operations is an ACS butterfly, which is usually implemented using an assembly language and appropriate instruction set extensions (such as SSE2) to speed up the decoding time.
Applications
The Viterbi decoding algorithm is widely used in the following areas:
External links
- [1] Details on Viterbi decoding, as well as a bibliography.
- [2] r=1/6 k=15 coding for the Cassini mission to Saturn
- [3] GPL Viterbi decoder software
This article is licensed under the GNU Free Documentation License. It uses material from Wikipedia Encyclopedia article "Viterbi Algorithm"
|