TY - JOUR AU - Torrellas, Josep AB - INTRODUCTION As transistor sizes continue to scale down, we are about to witness extraordinary levels of chip integration. Sometime early in the next decade, as we reach 7 nm, we will be able to integrate, for example, 1000 sizable cores and substantial memory on a single die. There are many unknowns as to how to build a general-purpose architecture in such an environment. However, we know that the main challenge will be to make it highly energy efficient. Energy and power consumption have emerged as the main obstacles to designing more capable architectures. Given this energy efficiency challenge, researchers have coined the term ‘Extreme Scale Computer Architecture’ to refer to computer organizations that, loosely speaking, are 100–1000 times more capable than current systems for the same power consumption and physical footprint. For example, these organizations should deliver a datacenter that provides exascale performance (1018 operations per second) for 20 MW, a departmental server that provides petascale performance (1015 operations per second) for 20 KW, and a portable device that provides sustained terascale performance (1012 operations per second) for 20 W. Extreme-scale computing is concerned with technologies that are applicable to all machine sizes—not just high-end systems. Extreme-scale computer architectures need to be designed for energy efficiency from the ground up. They need to have efficient support for concurrency, since only massive parallelism will deliver this performance. They should also minimize data transfers—since moving data around is a major source of energy consumption. Finally, they need to leverage new technologies that will be developed in the next few years. These technologies include low supply voltage (Vdd) operation, 3D die stacking, resistive memories, and photonic interconnects. In this paper, we outline some of challenges that appear at different layers of the computing stack, and some of the techniques that can be used to address them. BACKGROUND For several decades, the processor industry has seen a steady growth in CPU performance, driven by Moore's Law [1] and Classical (or Dennard) scaling [2]. Under classical scaling, the power density remains constant across semiconductor generations. Specifically, consider the dynamic power (Pdyn) consumed by a certain number of transistors that fit in a chip area A. The dynamic power is proportional to |$C \times V_{\text{dd}}^2 \times f$|, where C is the capacitance of the devices and f is the frequency of operation. Hence, the power density is proportional to |$C \times V_{\text{dd}}^2 \times f/A$|. As one moves to the next fabrication generation, the linear dimension of a device gets multiplied by a factor close to 0.7. The same is the case for Vdd and C, while the f gets multiplied by 1/0.7. Moreover, the area of the transistors is now 0.72 × A. If we compute the new power density, we have 0.7C × (0.7Vdd)2 × f/(0.73 × A). Consequently, the power density remains constant. Unfortunately, as the feature size decreased below 130 nm over a decade ago, classical scaling ceased to apply for two reasons. First, Vdd could not be decreased as fast as before. In fact, in recent years, it has stagnated around 1 V, mostly due to the fact that, as Vdd gets smaller and closer to the threshold voltage (Vth) of the transistor, the transistor's switching speed decreases fast. The second reason is that static power became significant. The overall result is that, under real scaling, the power density of a set of transistors increases rapidly with each generation—making it progressively harder to feed the needed power and extract the resulting heat. In addition, the large amount of power that needs to be provided causes concerns at both ends of the computing spectrum. At the high end, data centers are faced with large energy bills while, at the low end, hand-held devices are limited by the capacity of the batteries. Overall, all of these trends motivate the emergence of research on extreme-scale computing. ENERGY-EFFICIENT CHIP SUBSTRATE To realize extreme-scale computing systems, devices and circuits need to be designed to operate at low Vdd. This is because Vdd reduction is the best lever available to increase the energy efficiency of computing. Vdd reduction induces a quadratic reduction in dynamic energy, and a larger-than-linear reduction in static energy. As a result, an environment with a Vdd ≈ 500 mV is much more energy efficient than one with the conventional Vdd ≈ 0.9 V. It potentially consumes 40 times less power [3,4]. This substantial power reduction implies that many more cores can now be placed on a given power-constrained chip. Unfortunately, there are well-known drawbacks of low Vdd. They include a lower switching speed and a large increase in process variation—the result of Vdd being close to Vth. It is possible that researchers will find ways of delivering low-Vdd devices of acceptable speed. However, the issue of dealing with high-process variation is especially challenging. The effects of process variation Process variation is the deviation of the values of device parameters (such as a transistor's Vth, channel length, or channel width) from their nominal specification. Such variation causes variation in the switching speed and the static power consumption of nominally similar devices in a chip. At the architectural level, this effect translates into cores and on-chip memories that are slower and consume more static power than they would otherwise do. To see why, consider Fig. 1. Figure 1(a) shows a hypothetical distribution of the latencies of dynamic logic paths in a pipeline stage. The X axis shows the latency, while the Y axis shows the number of paths with such latency. Without process variation (taller curve), the pipeline stage can cycle at a frequency 1/τNOM. With variation (shorter curve), some paths become faster, while others slower. The pipeline stage's frequency is determined by the slower paths, and is now only 1/τVAR. Figure 1. Open in new tabDownload slide Effect of process variation on the speed (a) and static power consumption (b) of architecture structures. Figure 1. Open in new tabDownload slide Effect of process variation on the speed (a) and static power consumption (b) of architecture structures. Figure 1(b) shows the effect of process variation on the static power (PSTA). The X axis of the figure shows the Vth of different transistors, and the Y axis the transistors’ PSTA. The PSTA of a transistor is related to its Vth exponentially with |$P_{\text{STA}}\propto e^{-V_{\text{th}}}$|. Because of this exponential relationship, the static power saved by high-Vth transistors is less than the extra static power consumed by low-Vth transistors. Hence, integrating over all of the transistors in the core or memory module, total PSTA goes up with variation. Process variation has a systematic component that exhibits spatial correlation. This means that nearby transistors will typically have similar speed and power consumption properties. Hence, due to variation within a chip, some regions of the chip will be slower than others, and some will be more leaky than others. If we need to set a single Vdd and frequency for the whole chip, we need to set them according to the slowest and leakiest neighborhoods of the chip. This conservative design is too wasteful for extreme-scale computing. Multiple voltage domains Low-Vdd chips will be large and heavily affected by process variation. To tolerate process variation within a chip, the most appealing idea is to have multiple Vdd and frequency domains. A domain encloses a region with similar values of variation parameters. In this environment, we want to set a domain with slow transistors to higher Vdd, to make timing. On the other hand, we want to set a domain with fast, leaky transistors to lower Vdd, to save energy. For this reason, extreme-scale low-Vdd chips are likely to have multiple, possibly many, Vdd and frequency domains. However, current designs for Vdd domains are energy inefficient [5]. First, on-chip switching voltage regulators (SVR) that provide the Vdd for a domain have a high-power loss, often in the 10%–15% range. Wasting so much power in an efficiency-first environment is hardly acceptable. In addition, small Vdd domains are more susceptible to variations in the load offered to the power grid, due to lacking as much averaging effects as a whole-chip Vdd domain. These variations in the load induce Vdd droops that need to be protected against with larger Vdd guardbands [6]—also hardly acceptable in an efficiency-first environment. Finally, conventional SVRs take a lot of area and, therefore, including several of them on chip is unappealing. What is Needed To address these limitations, several techniques are needed. First, an extreme-scale chip needs to be designed with devices whose parameters are optimized for low-Vdd operation [7]. Simply utilizing conventional device designs can result in slow devices. Voltage regulators need to be designed for high-energy efficiency and modest area. One possible approach is to organize them in a hierarchical manner [8]. The first level of the hierarchy is composed of one or a handful of SVRs, potentially placed on a stacked die, with devices optimized for the SVR inductances. The second level is composed of many on-chip low-drop-out (LDO) voltage regulators. Each LDO is connected to one of the first-level SVRs and provides the Vdd for a core or a small number of cores. LDOs have high-energy efficiency if the ratio of their output voltage (Vo) to their input voltage (Vi) is close to 1. Thanks to systematic process variation, the LDOs in a region of the chip need to provide a similar Vo to the different cores of the region. Since these LDOs take their Vi from the same first-level SVR and their Vo is similar, their efficiency can be over 90%. In addition, their area is negligible: their hardware reuses the hardware of a power-gating circuit. Such circuit is likely to be already present in the chip to power gate the core. To minimize energy waste, the chip should have extensive power gating support. This is important at low Vdd because leakage accounts for the larger fraction of energy consumption. Ideally, power gating should be done at fine granularities, such as groups of cache lines, or groups of functional units. Fine granularities lead to high-potential savings, but complicate circuit design. A STREAMLINED ARCHITECTURE Simple organization For highest energy efficiency, an exteme-scale architecture should be mostly composed of many simple, throughput-oriented cores, and rely on highly parallel execution. Low-Vdd operation substantially reduces the power consumption, which can then be leveraged by increasing the number of cores that execute in parallel—as long as the application can exploit the parallelism. Such cores should avoid speculation and complex hardware structures as much as possible. Cores should be organized in clusters. Such organization is energy efficient because process variation has spatial correlation and, therefore, nearby cores and memories have similar variation parameter values—which can be exploited by the scheduler. To further improve energy efficiency, a cluster typically contains a heterogeneous group of compute engines. For example, it can contain one wide superscalar core (also called latency core) to run sequential or critical sections fast. The power delivery system should be configured such that this core can run at high Vdd in a turboboost manner. Moreover, some of the cores may have special instructions, such as special synchronization or transcendental operations. Minimizing energy in on-chip memories A large low-Vdd chip can easily contain hundreds of Mbytes of on-chip memory. To improve memory reliability and energy efficiency, it is likely that SRAM cells will be redesigned for low Vdd [9]. In addition, to reduce leakage, such memory will likely operate at higher Vdd than the logic. However, even accounting for this fact, the on-chip memories will incur substantial energy losses due to leakage. To reduce this waste, the chip may support power gating of sections of the memory hierarchy—e.g. individual on-chip memory modules, or individual ways of a memory module, or groups of lines in a memory module. In principle, this approach is appealing because a large fraction of such a large memory is likely to contain unneeded data at any given time. Unfortunately, this approach may be too coarse grained to make a significant impact on the total power consumed: to power gate a memory module, we need to be sure that none of the data in the module will be used soon. This situation may be rare in the general case. Instead, we need a fine-grained approach where we power-on only the individual on-chip memory lines that contain data that will be accessed very soon. To come close to this ideal scenario, we can use eDRAM rather than SRAM for the last levels of the cache hierarchy—either on- or off-chip. eDRAM has the advantage that it consumes much less leakage power than SRAMs. This saves substantial energy. However, eDRAM needs to be refreshed. Fortunately, refresh is done at the fine-grained level of a cache line, and we can design intelligent refresh schemes [10,11]. One intelligent refresh technique is to try to identify the lines that contain data that is likely to be used in the near future by the processors, and only refresh such lines in the eDRAM cache. The other lines are not refreshed and marked as invalid—after being written back to the next level of the hierarchy if they were dirty. To identify such lines, we can dynamically use the history of line accesses [10] or programmer hints. Another intelligent refresh technique is to refresh different parts of the eDRAM modules at different frequencies, exploiting the different retention times of different cells. This approach relies on profiling the retention times of different on-chip eDRAM modules or regions. For example, one can exploit the spatial correlation of the retention times of the eDRAM cells [11]. With this technique, we may refresh most of the eDRAM with long refresh periods, and only a few small sections with the conventional, short refresh periods. Minimizing energy in the on-chip network The on-chip interconnection network in a large chip is another significant source of energy consumption. Given the importance of communication and the relative abundance of chip area, a good strategy is to have wide links and routers, and power gate the parts of the hardware that are not in use at a given time. Hence, good techniques to monitor and predict network utilization are important. On-chip networks are especially vulnerable to process variation. This is because the network connects distant parts of the chip. As a result, it has to work in the areas of the chip that have the slowest transistors, and in those areas with the leakiest transistors. To address this problem, we can divide the network into multiple Vdd domains—each one including a few routers. Because of the systematic component of process variation, the routers in the same domain are likely to have similar values of process variation parameters. Then, a controller can gradually reduce the Vdd of each domain dynamically, while monitoring for timing errors in the messages being transmitted. Such errors are being detected and handled with already existing mechanisms in the network. When the controller observes an error rate in a domain that is higher than a certain threshold, the controller increases the Vdd of that domain slightly. In addition, the controller periodically decreases the Vdd of all the domains slightly, to account for changes in workloads and temperatures. Overall, with this approach, the Vdd of each domain converges to the lowest value that is still safe (without changing the frequency). We call this scheme Tangle [12]. REDUCING DATA MOVEMENT As technology scales, data movement contributes with an increasingly larger fraction of the energy consumption in the chip [13]. Consequently, we need to devise approaches to minimize the amount of data transferred. In this section, we discuss a few ways to do it. One approach is to organize the chip in a hierarchy of clusters of cores with memories. Then, the system software can colocate communicating threads and their data in the same cluster. This reduces the total amount of data movement needed. Another technique consists of using a single address space in the chip and directly managing in software the movement of data used by the application in the cache hierarchy. Many of the applications that will run on an extreme-scale 1000-core chip are likely to have relatively simple control and data structures—e.g. performing many of their computation in regular loops with analyzable array accesses. As a result, it is conceivable that a smart compiler performing extensive program analysis [14], possibly with help from the programmer, will be able to manage (and minimize) the movement of data in the on-chip memory hierarchy. In this case, the architecture would support simple instructions to manage the caches, rather than providing programmer-transparent hardware cache coherence. Such instructions can perform cache entry invalidation and cache entry writeback to the next level of the hierarchy. Plain writes do not invalidate other cached copies of the data, and plain reads return the closest valid copy of the data. While the machine is now certainly harder to program, it may eliminate some data movement inefficiencies associated with the hardware cache coherence protocol—such as false sharing, or moving whole lines when only a fraction of the data in the line is used. In addition, by providing a single address space, we eliminate the need to copy data on communication, as in message-passing models. A third way of reducing the amount of data transfers is to use processing in memory (PIM) [15]. The idea is to add simple processing engines close to or embedded in the main memory of the machine, and use them to perform some operations on the nearby data in memory—hence, avoiding the round trip from the main processor to the memory. While PIM has been studied for at least 20 years, we may now see it become a reality. Companies are building 3D stacks that contain multiple memory dies on top of a logic die [16]. Currently, the logic die only includes advanced memory controller functions plus self-test and error detection, correction, and repair. However, it is easy to imagine how to augment the capabilities of the logic die to support Intelligent Memory Operations [17]. These can consist of preprocessing the data as it is read from the DRAM stack into the processor chip. They can also involve performing operations in place on the DRAM data. Finally, another means of reducing communication is to support efficient communication and synchronization hardware primitives, such as those that avoid spinning over the network. These may include dynamic hierarchical hardware barriers, or efficient point-to-point synchronization between two cores using hardware full-empty bits [18]. PROGRAMMING EXTREME-SCALE MACHINES The system software in extreme-scale machines has to be aware of the process variation profile of the chip. This includes knowing, for each cluster, the Vdd and frequency it can support and the leakage power it dissipates. With this information, the system software can make scheduling decisions that maximize energy efficiency. Similarly, the system software should monitor different aspects of hardware components, such as their usage, the energy consumed, and the temperature reached. With this information, it can make decisions on what components to power gate, or what Vdd and frequency setting to use—possibly with help from application hints. Application software is likely to be harder to write for extreme-scale architectures than for conventional machines. This is because, to save energy in data transfers, the programmer has to carefully manage locality and minimize communication. Moreover, the use of low Vdd requires more concurrency to attain the same performance. An important concern is how users will program these extreme-scale architectures. In practice, there are different types of programmers based on their expertise. Some are experts, in which case they will be able to map applications to the best clusters, set the Vdd and frequency of the clusters, and manage the data in the cache hierarchy well. They will obtain good energy efficiency. However, many programmers will likely be relatively inexperienced. Hence, they need a high-level programming model that is simple to program and allows them to express locality. One such model is Hierarchical Tiled Arrays (HTA) [19], which allows the computation to be expressed in recursive blocks or tiles. Another possible model is Concurrent Collections [20], which expresses the program in a dataflow-like manner. These are high-level models, and the compiler has to translate them into efficient machine code. For this, the compiler may have to rely on program autotuning to find the best code mapping in these complicated machines. CONCLUSION Attaining the 100-1000× improvement in energy efficiency required for extreme-scale computing involves rethinking the whole computing stack from the ground up for energy efficiency. In this paper, we have outlined some of the techniques that can be used at different levels of the computing stack. Specifically, we have discussed the need to operate at low voltage, provide multiple voltage domains, and support simple cores organized in clusters. Memories and networks can be optimized by reducing leakage and minimizing the guardbands of logic. Finally, data movement can be minimized by managing the data in the cache hierarchy, processing in memory, and utilizing efficient synchronization. A major issue that remains in these machines is the challenge of programmability. FUNDING This work was supported in part by the National Science Foundation under grant CCF-1012759, DARPA under PERFECT Contract Number HR0011-12-2-0019, and DOE ASCR under Award Numbers DE-FC02-10ER2599 and DE-SC0008717. REFERENCES 1. Moore GE Electronics 1965 38 114 7 2. Dennard RH Gaensslen FH Rideout VL et al. IEEE J Solid State Circuits 1974 9 256 68 Crossref Search ADS 3. Chang L Montoye RK et al. Proceedings of the IEEE February 2010 4. Dreslinski RG Wieckowski M Blaauw D et al. Proceedings of the IEEE 2010 5. Karpuzcu UR Sinkar AA Kim NS et al. EnergySmart: toward energy-efficient manycores for near-threshold computing International Symposium on High Performance Computer Architecture Shenzhen, China: IEEE Press 2013 Google Scholar OpenURL Placeholder Text WorldCat 6. James N Restle P Friedrich J et al. Comparison of split versus connected-core supplies in the POWER6 microprocessor International Solid-State Circuits Conference San Francisco, CA, USA: IEEE Press 2007 Google Scholar OpenURL Placeholder Text WorldCat 7. Wang H Kim NS Improving platform energy-chip area trade-off in near-threshold computing environment International Conference on Computer Aided Design San Jose, CA, United States: IEEE Press 2013 Google Scholar OpenURL Placeholder Text WorldCat 8. Ghasemi HR Sinkar A Schulte M et al. Cost-effective power delivery to support per-core voltage domains for power-constrained processors Design Automation Conference San Francisco, CA: IEEE Press 2012 Google Scholar OpenURL Placeholder Text WorldCat 9. Gemmeke T Sabry MM Stuijt J et al. Resolving the memory bottleneck for single supply near-threshold computing Conference on Design, Automation and Test in Europe Dresden, Germany: IEEE Press 2014 Google Scholar OpenURL Placeholder Text WorldCat 10. Agrawal A Jain P Ansari A et al. Refrint: intelligent refresh to minimize power in on-chip multiprocessor cache hierarchies International Symposium on High Performance Computer Architecture Shenzhen, China: IEEE Press 2013 Google Scholar OpenURL Placeholder Text WorldCat 11. Agrawal A Ansari A Torrellas J. Mosaic: exploiting the spatial locality of process variation to reduce refresh energy in on-chip eDRAM modules International Symposium on High Performance Computer Architecture Orlando, Florida, USA: IEEE Press 2014 Google Scholar OpenURL Placeholder Text WorldCat 12. Ansari A Mishra A Xu J et al. Tangle: route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks International Symposium on High Performance Computer Architecture Orlando, Florida, USA: IEEE Press 2014 Google Scholar OpenURL Placeholder Text WorldCat 13. Kogge P et al. ExaScale computing study: technology challenges in achieving exascale systems DARPA-IPTO Sponsored Study Washington DC, USA: IEEE Press 2008 Google Scholar OpenURL Placeholder Text WorldCat 14. Feautrier P. Some efficient solutions to the affine scheduling problem—part I one-dimensional time 1996 OpenURL Placeholder Text WorldCat 15. Kogge P. The EXECUBE approach to massively parallel processing International Conference on Parallel Processing St. Charles, Illinois: IEEE Press 1994 Google Scholar OpenURL Placeholder Text WorldCat 16. Micron Technology Inc. Hybrid memory cube 2011 http://www.micron.com/products/hybrid-memory-cube (15 January 2016, date last accessed) OpenURL Placeholder Text WorldCat 17. Fraguela B Feautrier P Renau J et al. Programming the FlexRAM parallel intelligent memory system International Symposium on Principles and Practice of Parallel Programming San Diego, California, USA 2003 Google Scholar OpenURL Placeholder Text WorldCat 18. Smith BJ Architecture and applications of the HEP multiprocessor computer system Real-Time Signal Processing IV 1982 241 8 Google Scholar OpenURL Placeholder Text WorldCat 19. Bikshandi G Guo J Hoeflinger D et al. Programming for parallelism and locality with hierarchically tiled arrays International Symposium on Principles and Practice of Parallel Programming New York, NY: IEEE Press 2006 Google Scholar OpenURL Placeholder Text WorldCat 20. Budimlic Z Chandramowlishwaran A Knobe K et al. Multi-core implementations of the concurrent collections programming model Workshop on Compilers for Parallel Computers Zurich, Switzerland: IEEE Press 2009 Google Scholar OpenURL Placeholder Text WorldCat © The Author(s) 2016. Published by Oxford University Press on behalf of China Science Publishing & Media Ltd. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. for commercial re-use, please contact journals.permissions@oup.com © The Author(s) 2016. Published by Oxford University Press on behalf of China Science Publishing & Media Ltd. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com TI - Extreme-scale computer architecture JF - National Science Review DO - 10.1093/nsr/nwv085 DA - 2016-03-01 UR - https://www.deepdyve.com/lp/oxford-university-press/extreme-scale-computer-architecture-fu0q7LA1J6 SP - 19 EP - 23 VL - 3 IS - 1 DP - DeepDyve ER -