Distributed Arithmetic (DA) is a widely used technique for implementing sum-of-products computations without the use of multipliers. Designers frequently use DA to build efficient Multiply-Accumulate Circuitry (MAC) for filters and other DSP applications. The main advantage of DA is its high computational efficiency. DA distributes multiply and accumulate operations across shifters, lookup tables (LUTs) and adders in such a way that conventional multipliers are not required.
In a DA realization of a FIR filter structure, a sequence of
input data words of width W
is fed through a parallel
to serial shift register, producing a serialized stream of bits. The
serialized data is then fed to a bit-wise shift register. This shift
register serves as a delay line, storing the bit serial data samples.
The delay line is tapped (based on the input word size W
),
to form a W
-bit address that indexes into a lookup
table (LUT). The LUT stores all possible
sums of partial products over the filter coefficients space. The LUT
is followed by a shift and adder (scaling accumulator) that adds the
values obtained from the LUT sequentially.
A table lookup is performed sequentially for each bit (in order of significance starting from the LSB). On each clock cycle, the LUT result is added to the accumulated and shifted result from the previous cycle. For the last bit (MSB), the table lookup result is subtracted, accounting for the sign of the operand.
This basic form of DA is fully serial, operating on one bit
at a time. If the input data sequence is W
bits
wide, then a FIR structure takes W
clock cycles
to compute the output. Symmetric and asymmetric FIR structures are
an exception, requiring W+1
cycles, because one
additional clock cycle is needed to process
the carry bit of the preadders.
You can control how DA code is generated by using the DALUTPartition
and DARadix
implementation
parameters. The DALUTPartition
and DARadix
parameters
have certain requirements and restrictions that are specific to different
filter types. These requirements are included in the discussions of
each parameter.
Reduce LUT Size: DALUTPartition
Improve Performance with Parallelism: DARadix
For information on the theoretical foundations of DA, see Further References.
Generation of DA code is supported only for fixed-point filter designs.
The data path in HDL code generated for the DA architecture is carefully optimized for full precision computations. The filter result is cast to the output data size only at the final stage when it is presented to the output.
Distributed arithmetic merges the product and accumulator operations and does computations at full precision. This approach ignores the Product output and Accumulator properties of the Digital Filter block and sets these properties to full precision.
DA ignores taps that have zero-valued coefficients and reduces the size of the DA LUT accordingly.
For symmetrical and asymmetrical filters:
A bit-level preadder or presubtractor is required to add tap data values that have coefficients of equal value and/or opposite sign. One extra clock cycle is required to compute the result because of the additional carry bit.
HDL Coder™ takes advantage of filter symmetry where possible. This reduces the DA LUT size substantially, because the effective filter length for these filter types is halved.
Detailed discussions of the theoretical foundations of DA appear in the following publications:
Meyer-Baese, U., Digital Signal Processing with Field Programmable Gate Arrays, Second Edition, Springer, pp 88–94, 128–143
White, S.A., Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review. IEEE ASSP Magazine, Vol. 6, No. 3