High Speed ACSU Architecture for Viterbi Decoder Using T-Algorithm

Atish A. Peshattiwar

1Department of Electronics Engineering, Yeshwantrao Chavan College of Engineering, Nagpur, India
atishp32@gmail.com

Abstract—In this paper, we propose an efficient architecture based on pre-computation for Viterbi decoders incorporating T-algorithm. Through optimization at both algorithm level and architecture level, the new architecture greatly shortens the long critical path introduced by the conventional T-algorithm. The design example provided in this work demonstrates more than twice improvement in clock speed with negligible computation overhead while maintaining decoding performance.

I. INTRODUCTION

A typical functional diagram of the corresponding Viterbi decoder (VD) is shown in Fig. 1. First, branch metrics (BMs) are calculated from the received symbols. Then, BMs are fed into the add-compare-select unit (ACSU) that recursively computes the path metrics (PMs) and outputs decision bits for each possible state transition. After that, the decision bits are stored in and retrieved from the survivor-path memory unit (SMU) in order to decode the source bits along the final survivor path. The PMs of the current iteration are stored in the PM unit (PMU) and read out for use in the next iteration.

The algorithm compares the differences with T; only the states with a difference less than T survive and are used for the calculation in the next cycle. Since the process involves the searching of the optimal PM in the ACS loop, clock speed of the entire design will decrease. To achieve high speed, it is possible to implement a 2(k-1) inputs comparator with fully parallel architecture. However, it will cause significant hardware overhead, which conflicts with the design goal of less computation and low power consumption. Several works [4]-[6] have proposed new schemes for high speed T-algorithm implementation. All these schemes use an estimated (or approximated) optimal PM derived from the optimal BM instead of finding the accurate value in each cycle. Also, in these methods, compensation schemes are introduced to ensure that the estimated value is not drifting too far away from the accurate one.

In this paper, we propose a new architecture for the Viterbi decoder using T-algorithm. Unlike existing works such as [4]-[6], an accurate optimal PM is guaranteed to be found at each cycle and no extra parameters are needed. Since the optimal PM is accurate in the proposed architecture, the new architecture keeps the same BER performance as the conventional T-algorithm, and is well suited for high-rate codes. Meanwhile, pre-computation combined with pipelining greatly shortens the long critical path. It will be shown in Section III that the critical path of the new scheme could reach the theoretical iteration bound.

The remainder of the paper is organized as follows. Section II presents the general idea of the proposed pre-computation scheme. The example of a high-rate code and the details of the design are discussed in Section III. Section IV presents the simulation and synthesis results, followed by conclusion in Section V.

For VD implementation, research efforts have been focused on the ACSU and SMU. ACSU implementation is critical because the feedback loop makes it the bottleneck for high speed applications. Furthermore, as the constraint length K increases, the computation complexity and power consumption increase exponentially. To deal with these drawbacks, it is a common practice to use T-algorithm [2], [3] in VD for power saving purpose. In the T-algorithm, a threshold T is set and the difference between each PM and the optimal one is calculated.
II. THE PRE-COMPUTATION ALGORITHM

The basic idea of pre-computation is as follows. Consider a VD for a convolutional code with the constraint length of k, where each state receives p candidate paths. If the branch metrics are calculated based on the Euclidean distance, the optimal PM becomes the minimum value of all the PMs.

\[ \text{PMopt}(n) = \min \{ \text{PM}(n), \text{PM}(n), \ldots, \text{PM}(n) \} \]

\[ = \min \{ \min [\text{PM}(n-1) + \text{BM}(n)], \text{PM}(n-1) + \text{BM}(n), \ldots, \text{PM}(n-1) + \text{BM}(n) \} \]

\[ = \min \{ \min [\text{PM}(n-1) + \text{BM}(n)], \text{PM}(n-1) + \text{BM}(n), \ldots, \text{PM}(n-1) + \text{BM}(n) \} \]

\[ = \min \{ \min [\text{PM}(n-1) + \text{BM}(n)], \text{PM}(n-1) + \text{BM}(n), \ldots, \text{PM}(n-1) + \text{BM}(n) \} \]

\[ = \min \{ \min [\text{PM}(n-1) + \text{BM}(n)], \text{PM}(n-1) + \text{BM}(n), \ldots, \text{PM}(n-1) + \text{BM}(n) \} \]

\[ = \min \{ \text{PM}(n-1) + \text{BM}(n), \ldots, \text{PM}(n-1) + \text{BM}(n) \} \]

The trellis butterflies for a VD usually have a symmetric structure. In other words, the states can be grouped into m clusters, where all the clusters have the same number of states and all the states in the same cluster will be extended by the same BMs. Thus, Eq. (1) can be rewritten as

\[ \text{min} (\text{PMs} (n-1) \text{ in cluster 1}) + \min (\text{BM}s(n) \text{ for cluster 1}), \]

\[ \text{min} (\text{PMs} (n-1) \text{ in cluster 2}) + \min (\text{BM}s(n) \text{ for cluster 1}), \]

\[ \ldots \]

\[ \text{min} (\text{PMs} (n-1) \text{ in cluster m}) + \min (\text{BM}s(n) \text{ for cluster m}), \]

\[ \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{
A crucial issue with implementing T-algorithm is how to quickly find out the optimal PM (the minimum value).

The shortest critical path we can achieve is from the regular ACSU without T-algorithm. That’s the amount of time each state needs to update its state value, as shown below:

\[ T_{full_trellis} = T_{adder} + T_{8-input_comparator} \]

A fully parallel 8-input comparator needs 28 adders and a large look-up table. To achieve a balanced trade-off between hardware area and clock speed, all comparators are designed to process at most 4 data in parallel. Comparators dealing with more inputs are built up with 2-input or 4-input comparators. The critical path now becomes

\[ T_{full_trellis} = T_{adder} + T_{4-in_comp} + T_{2-in-comp} \] (3)

Next, let us consider the VD with T-algorithm. The general functional diagram is shown in Fig. 5, where T-algorithm is implemented in the “PM purge algorithm” unit (PPAU).

In a VD with conventional T-algorithm implementation, the optimal PM is calculated from the 64 newly updated PMs. To find the minimum value of the 64 PMs, we use an architecture consisting of 3-stage 4-input comparators as shown in Fig. 6.

The critical path for the conventional T-algorithm implementation is computed by Eq. (4).

\[ T_{conv_T_algs} = T_{adder} + T_{4-in_comp} + T_{2-in-comp} + 3 T_{4-in_comp} + T_{adder} + T_{2-in-comp} \] (4)

\[ = 2 T_{adder} + 4 T_{4-in_comp} + 2 T_{2-in-comp} \]

The scheme proposed in [6] is called the SPEC-T algorithm, where it is assumed the value of (PMopt + threshold) can be obtained during the period when ACSU updates new PMs. The only extra calculation for SPEC-T algorithm is the comparison between the PMopt and all the PMs. Therefore, the critical path for the SPEC-T scheme is greatly shortened as shown in Eq. (5).

\[ T_{SPEC-T_algs} = T_{adder} + T_{4-in_comp} + T_{2-in-comp} + T_{2-in-comp} \] (5)

\[ = T_{adder} + T_{4-in_comp} + 2 T_{2-in-comp} \]

\[ T_{SPEC-T-algorithm} \] is also the iteration bound we can get for VD when T-algorithm is employed. The functional block of SPEC-T algorithm is slightly different from the one shown in Fig. 3, where the minimum BM is sent to the PPAU from the BMU. Since the estimated optimal PM is calculated each cycle, an accurate optimal PM is also needed every 6 to 7 cycles to compensate for the estimated one. For example, at time slot n, the decoder memorizes PMopt_esti (n) and PMs(n). After 7 cycles, PMopt_esti is added to PMopt_est (n+7). The problem with this compensation scheme is that the error between PMopt_est_i and PMopt_esti accumulates over at least 7-cycles due to the inherent delay of the scheme itself.
A. One-step pre-computation

For the convenience of discussion, we define the left-most register in Fig. 2 as the most-significant-bit (MSB) and the right-most register as the least-significant-bit (LSB). The 64 states and PMs are labeled from 0 to 63.

A careful study reveals that the 64 states can be partitioned in two groups: odd-numbered PMs (when the LSB is ‘1’) and even-numbered PMs (when the LSB is ‘0’). The odd PMs are all extended by odd BMs (when Z0 is ‘1’) and the even PMs are all extended by even BMs (when Z0 is ‘0’). The minimum PM becomes:

\[ PM_{opt}(n) = \min(\min(\text{even PMs}(n-1)) + \min(\text{even BMs}(n)), \min(\text{odd PMs}(n-1)) + \min(\text{odd BMs}(n))) \]

The functional diagram of the 1-step pre-computation scheme is shown in Fig. 7. Notice that, in Fig. 5, the PPAU have to wait for the new PMs from the ACSU to calculate the optimal PM, while in Fig. 7 the optimal PM is calculated directly from PMs in the previous cycles at the same time when the ACSU is calculating the new PMs. The details of the PPAU are shown in Fig. 8.

Figure 7. Functional block of a Viterbi decoder with 1-step pre-computation T-algorithm.

B. Two-step pre-computation

We again need to analyze the trellis transition of the original code. In the 1-step pre-computation architecture, we have pointed out that for the particular code shown in Fig. 2, odd-numbered states are extended by odd BMs, while even-numbered states are extended by even BMs. Furthermore, the even states extend to states with higher indices (the MSB in Fig. 2 is ‘1’) in the trellis transition, while the odd states extend to states with lower indices (the MSB is ‘0’ in Fig. 2). This information allows us to obtain the 2-step pre-computation data path. This process is straightforward, although the mathematical details are tedious. For clarity, we only provide the main conclusion here.

The states are further grouped into 4 clusters as described by Eq. (7). The BMs are categorized in the same way and are described by Eq. (8).

\[
\begin{align*}
\text{cluster}_3 &= \{PM_m \mid 0 \leq m \leq 63, m \mod 4 = 3\} \\
\text{cluster}_2 &= \{PM_m \mid 0 \leq m \leq 63, m \mod 4 = 1\} \\
\text{cluster}_1 &= \{PM_m \mid 0 \leq m \leq 63, m \mod 4 = 2\} \\
\text{cluster}_0 &= \{PM_m \mid 0 \leq m \leq 63, m \mod 4 = 0\}
\end{align*}
\]

\[
\begin{align*}
\text{BMG}_3 &= \{BM_m \mid 0 \leq m \leq 15, m \mod 4 = 3\} \\
\text{BMG}_2 &= \{BM_m \mid 0 \leq m \leq 15, m \mod 4 = 1\} \\
\text{BMG}_1 &= \{BM_m \mid 0 \leq m \leq 15, m \mod 4 = 2\} \\
\text{BMG}_0 &= \{BM_m \mid 0 \leq m \leq 15, m \mod 4 = 0\}
\end{align*}
\]

The critical path of the 1-step pre-computation scheme is

\[ T_{1\text{-step pre}} = 2T_{\text{Adder}} + 2T_{4\text{-in comp}} + 3T_{2\text{-in comp}} \ldots \ldots (6) \]

which is much shorter than that in Eq. (4). Note that compared with Fig. 6, the hardware overhead of the 1-step pre-computation scheme is about 4 adders, which is negligible. Compared with the SEPC-T algorithm, however, the critical path of the 1-step pre-computation scheme is still long. In order to further shorten the critical path, we explore the 2-step pre-computation design next.

Figure 8. Implementation of 1-step pre-computation T-algorithm.
The optimal PM at time $n$ is calculated as

$$
\text{PM}_{\text{opt}}(n) = \min \left\{ \min \{ \min (\text{cluster0}(n-2)) + \min (\text{BM}(n-1)) \}, \\
\min \{ \min (\text{cluster1}(n-2)) + \min (\text{BM}(n-1)) \}, \\
\min \{ \min (\text{cluster2}(n-2)) + \min (\text{BM}(n-1)) \}, \\
\min \{ \min (\text{cluster3}(n-2)) + \min (\text{BM}(n-1)) \} + \min \{ \min (\text{even BMs}(n)) \}, \\
\min \{ \min (\text{cluster0}(n-2)) + \min (\text{BM}(n-1)) \}, \\
\min \{ \min (\text{cluster1}(n-2)) + \min (\text{BM}(n-1)) \}, \\
\min \{ \min (\text{cluster2}(n-2)) + \min (\text{BM}(n-1)) \}, \\
\min \{ \min (\text{cluster3}(n-2)) + \min (\text{BM}(n-1)) \} + \min \{ \min (\text{odd BMs}(n)) \} \right\}.
$$

$(9)$

An intuitive illustration of the process is shown in Fig. 9. The “MIN 16” unit for finding the minimum value in each cluster is constructed with 2 stages of 4-input comparators.

Calculating $\text{PM}_{\text{opt}}(n)$ from $\text{PM}(n-2)$ means that the calculation can be completed within 2 cycles. Thus, the process is pipelined as two stages as indicated by the dashed line in Fig. 9. Again, we need to examine the critical path of each stage. The critical paths of the first stage (left side of the dashed line in Fig. 9) and the second stage (right side of the dashed line) are expressed in Eq. (10) and Eq. (11), respectively.

$$
T(\text{STAGE 1})_{2\text{-step pre-T}} = T_{\text{adder}} + 2 T_{4\text{-in\_comp}} \ldots \ldots \ldots \ldots \ldots (10)
$$

$$
T(\text{STAGE 2})_{2\text{-step pre-T}} = T_{\text{adder}} + T_{4\text{-in\_comp}} + 2 T_{2\text{-in\_comp}}. (11)
$$

Comparing with Eq. (5), the shortest path needed for $T$-algorithm, we find that $T(\text{stage}1)_{2\text{-step pre-T}}$ is shorter since the gate delay of the 2-stage 2-input comparator is slightly longer than that of a fully parallel 4-input comparator. On the other hand, $T(\text{stage}2)_{2\text{-step pre-T}}$ is longer than $T_{\text{SPEC T\_alg}}$ by the delay of an adder. However, by re-arranging the process and introducing one redundant adder, we can further reduce $T(\text{stage}2)_{2\text{-step pre-T}}$ to that in Eq. (5). The details are shown in Fig. 10.

In Fig. 10, the left side of the dashed line remains the same. However, at the right side, the threshold value is added to both the $\min$ (even BMs) and $\min$ (odd BMs) before $\text{PM}_{\text{opt}}$ is found out. Now, the critical path of the PPAU for the 2-step pre-computation scheme is the same as the iteration bound in Eq. (5). Compared with the conventional implementation for $T$-algorithm, the hardware overhead of the architecture in Fig. 10 includes 11 adders, a 4-input comparator and a 2-input comparator, which is about the same size of the ACS circuitry for one state.

More steps of pre-computation can be further achieved with larger hardware overhead; however, it is generally unnecessary in practice.

IV. SIMULATION AND IMPLEMENTATION RESULTS

The proposed pre-computation scheme has the same performance as the original $T$-algorithm shown in Fig. 3 since for both case the accurate optimal PM is found each cycle.

In [6], it is suggested that the SPEC-T algorithm adjusts the estimated value every 7 cycles. However, our simulation shows that if the estimated value is adjusted every 3 or more cycles, there is a high probability ($>3\%$ at a BER $\approx 10^{-3}$) that the decoder will lose all the survival paths during the decoding process due to the purging scheme according to the threshold and the estimated $\text{PM}_{\text{opt}}$. When the adjustment frequency reduces to once every 2 cycles, the probability drops down to 0.1%, which is still not acceptable for practical systems. Therefore, the SPEC-T algorithm must adjust the estimated value each cycle, which is equivalent to the conventional $T$-algorithm.

For a more detailed comparison, we implemented, in FPGA, the ACSU using several different schemes: 1) conventional implementation of $T$-algorithm, 2) the proposed 1-step pre-computation scheme, and 3) 2-step pre-computation schemes. SPEC-T algorithm is not considered here since it degrades to conventional $T$-algorithm. The synthesis results are summarized in Table I.
### TABLE I  SYNTHESIS RESULTS

<table>
<thead>
<tr>
<th></th>
<th>Number of 4 input LUTs</th>
<th>Number of flip flops</th>
<th>Minimum period (ns)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Regular T-algorithm</strong></td>
<td>23326 (17 %)</td>
<td>576 (0 %)</td>
<td>39.70</td>
</tr>
<tr>
<td><strong>1-step pre-comp</strong></td>
<td>21570 (15 %)</td>
<td>648 (0 %)</td>
<td>61.50</td>
</tr>
<tr>
<td><strong>2-step pre-comp</strong></td>
<td>22258 (16 %)</td>
<td>686 (0 %)</td>
<td>81.97</td>
</tr>
</tbody>
</table>

Table I shows that by applying the 2-step pre-computation architecture, the clock speed is doubled comparing with the conventional implementation of T-algorithm. It is also observed that, the pre-computation architecture requires so small hardware overhead that it is not evident in FPGA synthesis result.

### V. CONCLUSION

In this paper, a pre-computation scheme with associated hardware architecture is proposed for Viterbi decoders employing T-algorithm. Compared with existing schemes that target at efficient implementation of T-algorithm, the proposed approach is more reliable in general. The analysis of the critical path reveals that the pre-computation scheme can achieve the iteration bound for Viterbi decoders employing T-algorithm with negligible hardware overhead. Simulation results show that the proposed scheme maintains the same BER performance as the conventional T-algorithm while other schemes could completely fail decoding. Synthesis results with FPGA have verified the significant speedup of the proposed design.

### REFERENCES


