Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Optimized Unrolling of Nested Loops

Optimized Unrolling of Nested Loops Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png International Journal of Parallel Programming Springer Journals

Optimized Unrolling of Nested Loops

Loading next page...
 
/lp/springer-journal/optimized-unrolling-of-nested-loops-0lw0T6vlCT

References (29)

Publisher
Springer Journals
Copyright
Copyright © 2001 by Plenum Publishing Corporation
Subject
Computer Science; Processor Architectures; Software Engineering/Programming and Operating Systems; Theory of Computation
ISSN
0885-7458
eISSN
1573-7640
DOI
10.1023/A:1012246031671
Publisher site
See Article on Publisher Site

Abstract

Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).

Journal

International Journal of Parallel ProgrammingSpringer Journals

Published: Oct 16, 2004

There are no references for this article.