Xia et al. / J Zhejiang Univ Sci A 2009 10(7):1067-1074 1067
New method for high performance multiply-accumulator design
*
Bing-jie XIA
†
, Peng LIU
†‡
, Qing-dong YAO
(Department of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China)
†
E-mail: icysummer@zju.edu.cn; liupeng@isee.zju.edu.cn
Received July 27, 2008; Revision accepted Oct. 28, 2008; Crosschecked Apr. 27, 2009
Abstract: This study presents a new method of 4-pipelined high-performance split multiply-accumulator (MAC) architecture,
which is capable of supporting multiple precisions developed for media processors. To speed up the design further, a novel partial
product compression circuit based on interleaved adders and a modified hybrid partial product reduction tree (PPRT) scheme are
proposed. The MAC can perform 1-way 32-bit, 4-way 16-bit signed/unsigned multiply or multiply-accumulate operations and
2-way parallel multiply add (PMADD) operations at a high frequency of 1.25 GHz under worst-case conditions and 1.67 GHz
under typical-case conditions, respectively. Compared with the MAC in 32-bit microprocessor without interlocked piped stages
(MIPS), the proposed design shows a great advantage in speed. Moreover, an improvement of up to 32% in throughput is achieved.
The MAC design has been fabricated with Taiwan Semiconductor Manufacturing Company (TSMC) 90-nm CMOS standard cell
technology and has passed a functional test.
Key words: Multiply-accumulator (MAC), Pipeline, Compressor, Partial product reduction tree (PPRT), Split structure
doi:10.1631/jzus.A0820566 Document code: A CLC number: TP332
INTRODUCTION
Multiply-accumulate operation is one of the ba-
sic arithmetic operations extensively used in modern
digital signal processing (DSP). Most arithmetic, such
as digital filtering, convolution and fast Fourier
transform (FFT), requires high-performance multiply-
accumulate operations. The multiply-accumulator
(MAC) unit always lies in the critical path that de-
termines the speed of the overall hardware systems.
Therefore, a high-speed MAC that is capable of
supporting multiple precisions and parallel operations
is highly desirable.
The existing MAC implementation methods in
the literature can be generally classified into three
categories. The first category is the recursive MAC
method (Clark et al., 2001; Liao and Roberts, 2002),
which builds wider vector elements out of several
narrower ones and then adds the multiple results to-
gether. It is achieved iteratively by recalculating the
data back through the unit over more than one cycle.
This method saves hardware resource but requires
several clock cycles per operation. The second cate-
gory involves the parallel MAC method (Perri et al.,
2005; Parandeh-Afshar et al., 2006; MIPS Technolo-
gies Inc., 2006; 2007) implemented by unrolling the
iterative loop of recursive MAC method, which
achieves high speed at the cost of hardware resources.
This method has been applied in most field pro-
grammable gate array (FPGA) chips of the Xilinx
Corporation. The last category is the shared-
segmentation vector MAC method (Tan et al., 2003;
Danysh and Tan, 2005; Wang et al., 2008), which
shares partial product reduction tree (PPRT) and final
carry-propagate adder (CPA) between different pre-
cision modes by inserting mode-dependent logics.
This method is capable of supporting multiple preci-
sion operations. However, the single shared output
register structure limits the throughput and the com-
plex PPRT structure results in lower speed.
In this study, we propose a high-speed split MAC
that takes advantage of parallel and shared-segmen-
Journal of Zhejiang University SCIENCE A
ISSN 1673-565X (Print); ISSN 1862-1775 (Online)
www.zju.edu.cn/jzus; www.springerlink.com
E-mail: jzus@zju.edu.cn
‡
Corresponding author
*
Project (No. 60873112) supported by the National Natural Science
Foundation of China