Optimized Unrolling of Nested Loops

Vivek Sarkar

doi:10.1023/A:1012246031671

Loading next page...

References (29)

S. Weiss, James Smith (1987)
A study of scalar compilation techniques for pipelined supercomputers
Proceedings of the second international conference on Architectual support for programming languages and operating systems
D. Bacon, S. Graham, O. Sharp (1994)
Compiler transformations for high-performance computing
ACM Comput. Surv., 26
Reese Jones, V. Allan (1991)
Software pipelining: an evaluation of enhanced pipelining
B. Simons, Vivek Sarkar, M. Breternitz, Michael Lai (1994)
An optimal asynchronous scheduling algorithm for software cache consistency
1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences, 2
Allan Porterfield, K. Kennedy (1989)
Software methods for improvement of cache performance on supercomputer applications
J. Ferrante, Vivek Sarkar, W. Thrash (1991)
On Estimating and Enhancing Cache Effectiveness
J. Davidson, S. Jinturkar (1996)
Aggressive Loop Unrolling in a Retargetable Optimizing Compiler
Michael Alexander, Mark Bailey, Bruce Childers, J. Davidson, Sanjay Jinturkar (1993)
Memory bandwidth optimizations for wide-bus machines
[1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences, i
(1994)
Don’t Waste Those Cycles: An In-Depth Look at Scheduling Instructions in Basic Blocks and Loops, Video Lecture in University Video Communication’s
Daniel Lavery, W. Hwu (1995)
Unrolling-based optimizations for modulo scheduling
Proceedings of the 28th Annual International Symposium on Microarchitecture
(1972)
A catalogue of optimizing transformations, in Design and Optimization of Compilers, Prentice-Hall, pp
T. Mowry (1994)
Tolerating latency through software-controlled data prefetching
(1989)
Optimizing Supercompilersfor Supercomputers . Pitman, London and The
Vivek Sarkar (1997)
Automatic selection of high-order transformations in the IBM XL FORTRAN compilers
IBM J. Res. Dev., 41
B. Rau (1994)
Iterative modulo scheduling: an algorithm for software pipelining loops
(1997)
The Standard Performance Evaluation Corporation. SPEC CPU95 Benchmarks
(1984)
A Smart Compiler and a Dumb Machine, Proc
(1994)
POWER2 and PowerPC
IBM J. Res. Dev., 38
S. Carr, K. Kennedy (1994)
Improving the ratio of memory operations to floating-point operations in loops
ACM Trans. Program. Lang. Syst., 16
S. Carr, K. Kennedy (1994)
Scalar replacement in the presence of conditional control flow
Software: Practice and Experience, 24
Michael Wolf, M. Lam (1991)
A data locality optimizing algorithm
, 39
Vivek Sarkar (1991)
Automatic partitioning of a program dependence graph into parallel tasks
IBM J. Res. Dev., 35
J. Dongarra, A. Hinds (1979)
Unrolling loops in fortran
Software: Practice and Experience, 9
Vivek Sarkar (1989)
Determining average program execution times and their variance
B. Su, S. Ding, Jian Wang, J. Xia (1987)
GURPR—a method for global software pipelining
Vivek Sarkar, R. Thekkath (1992)
A general framework for iteration-reordering loop transformations
D. Jack W, J. Sanjay (1996)
Compiler Construction, Proc. Sixth Int'l. Conf. Linkoping, Sweden, Vol. 1060, Lecture Notes in Computer Science
S. Carr, Yiping Guan (1997)
Unroll-and-jam using uniformly generated sets
Proceedings of 30th Annual International Symposium on Microarchitecture
(1993)
Compiler Solutions for the Stale-Data and False-Sharing Problems, Technical report, TR 03.466

Publisher: Springer Journals
Copyright: Copyright © 2001 by Plenum Publishing Corporation
Subject: Computer Science; Processor Architectures; Software Engineering/Programming and Operating Systems; Theory of Computation
ISSN: 0885-7458
eISSN: 1573-7640
DOI: 10.1023/A:1012246031671
Publisher site: See Article on Publisher Site

Abstract

Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).

Journal

International Journal of Parallel Programming – Springer Journals

Published: Oct 16, 2004

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Optimized Unrolling of Nested Loops

Optimized Unrolling of Nested Loops

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Optimized Unrolling of Nested Loops

Optimized Unrolling of Nested Loops

References (29)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies