Access the full text.
Sign up today, get DeepDyve free for 14 days.
S. Weiss, James Smith (1987)
A study of scalar compilation techniques for pipelined supercomputersProceedings of the second international conference on Architectual support for programming languages and operating systems
D. Bacon, S. Graham, O. Sharp (1994)
Compiler transformations for high-performance computingACM Comput. Surv., 26
Reese Jones, V. Allan (1991)
Software pipelining: an evaluation of enhanced pipelining
B. Simons, Vivek Sarkar, M. Breternitz, Michael Lai (1994)
An optimal asynchronous scheduling algorithm for software cache consistency1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences, 2
Allan Porterfield, K. Kennedy (1989)
Software methods for improvement of cache performance on supercomputer applications
J. Ferrante, Vivek Sarkar, W. Thrash (1991)
On Estimating and Enhancing Cache Effectiveness
J. Davidson, S. Jinturkar (1996)
Aggressive Loop Unrolling in a Retargetable Optimizing Compiler
Michael Alexander, Mark Bailey, Bruce Childers, J. Davidson, Sanjay Jinturkar (1993)
Memory bandwidth optimizations for wide-bus machines[1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences, i
(1994)
Don’t Waste Those Cycles: An In-Depth Look at Scheduling Instructions in Basic Blocks and Loops, Video Lecture in University Video Communication’s
Daniel Lavery, W. Hwu (1995)
Unrolling-based optimizations for modulo schedulingProceedings of the 28th Annual International Symposium on Microarchitecture
(1972)
A catalogue of optimizing transformations, in Design and Optimization of Compilers, Prentice-Hall, pp
T. Mowry (1994)
Tolerating latency through software-controlled data prefetching
(1989)
Optimizing Supercompilersfor Supercomputers . Pitman, London and The
Vivek Sarkar (1997)
Automatic selection of high-order transformations in the IBM XL FORTRAN compilersIBM J. Res. Dev., 41
B. Rau (1994)
Iterative modulo scheduling: an algorithm for software pipelining loops
(1997)
The Standard Performance Evaluation Corporation. SPEC CPU95 Benchmarks
(1984)
A Smart Compiler and a Dumb Machine, Proc
(1994)
POWER2 and PowerPCIBM J. Res. Dev., 38
S. Carr, K. Kennedy (1994)
Improving the ratio of memory operations to floating-point operations in loopsACM Trans. Program. Lang. Syst., 16
S. Carr, K. Kennedy (1994)
Scalar replacement in the presence of conditional control flowSoftware: Practice and Experience, 24
Michael Wolf, M. Lam (1991)
A data locality optimizing algorithm, 39
Vivek Sarkar (1991)
Automatic partitioning of a program dependence graph into parallel tasksIBM J. Res. Dev., 35
J. Dongarra, A. Hinds (1979)
Unrolling loops in fortranSoftware: Practice and Experience, 9
Vivek Sarkar (1989)
Determining average program execution times and their variance
B. Su, S. Ding, Jian Wang, J. Xia (1987)
GURPR—a method for global software pipelining
Vivek Sarkar, R. Thekkath (1992)
A general framework for iteration-reordering loop transformations
D. Jack W, J. Sanjay (1996)
Compiler Construction, Proc. Sixth Int'l. Conf. Linkoping, Sweden, Vol. 1060, Lecture Notes in Computer Science
S. Carr, Yiping Guan (1997)
Unroll-and-jam using uniformly generated setsProceedings of 30th Annual International Symposium on Microarchitecture
(1993)
Compiler Solutions for the Stale-Data and False-Sharing Problems, Technical report, TR 03.466
Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).
International Journal of Parallel Programming – Springer Journals
Published: Oct 16, 2004
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.