Minos: Architectural support for protecting control dataCrandall, Jedidiah R.; Wu, S. Felix; Chong, Frederic T.
doi: 10.1145/1187976.1187977pmid: N/A
We present Minos, a microarchitecture that implements Biba's low water-mark integrity policy on individual words of data. Minos stops attacks that corrupt control data to hijack program control flow, but is orthogonal to the memory model. Control data is any data that is loaded into the program counter on control-flow transfer, or any data used to calculate such data. The key is that Minos tracks the integrity of all data, but protects control flow by checking this integrity when a program uses the data for control transfer. Existing policies, in contrast, need to differentiate between control and noncontrol data a priori , a task made impossible by coercions between pointers and other data types, such as integers in the C language. Our implementation of Minos for Red Hat Linux 6.2 on a Pentium-based emulator is a stable, usable Linux system on the network on which we are currently running a web server (http://minos.cs.ucdavis.edu). Our emulated Minos systems running Linux and Windows have stopped ten actual attacks. Extensive full-system testing and real-world attacks have given us a unique perspective on the policy tradeoffs that must be made in any system, such as Minos; this paper details and discusses these. We also present a microarchitectural implementation of Minos that achieves negligible impact on cycle time with a small investment in die area, as well as and minor changes to the Linux kernel to handle the tag bits and perform virtual memory swapping.
Analysis of cache-coherence bottlenecks with hybrid hardware/software techniquesMarathe, Jaydeep; Mueller, Frank; de Supinski, Bronis R.
doi: 10.1145/1187976.1187978pmid: N/A
Application performance on high-performance shared-memory systems is often limited by sharing patterns resulting in cache-coherence bottlenecks. Current approaches to identify coherence bottlenecks incur considerable run-time overhead and do not scale. We present two novel hardware-assisted coherence-analysis techniques that reduce trace sizes by two orders of magnitude over full traces. First, hardware performance monitoring is combined with capturing stores in software to provide a lossy-trace mechanism, which is an order of magnitude faster than software-instrumentation-based full-tracing and retains accuracy. Second, selected long-latency loads are instrumented via binary rewriting, which provides even higher accuracy and control over tracing, but requires additional overhead.
Future execution: A prefetching mechanism that uses multiple cores to speed up single threadsGanusov, Ilya; Burtscher, Martin
doi: 10.1145/1187976.1187979pmid: N/A
This paper describes future execution (FE), a simple hardware-only technique to accelerate individual program threads running on multicore microprocessors. Our approach uses available idle cores to prefetch important data for the threads executing on the active cores. FE is based on the observation that many cache misses are caused by loads that execute repeatedly and whose address-generating program slices do not change (much) between consecutive executions. To exploit this property, FE dynamically creates a prefetching thread for each active core by simply sending a copy of all committed, register-writing instructions to an otherwise idle core. The key innovation is that on the way to the second core, a value predictor replaces each predictable instruction in the prefetching thread with a load immediate instruction, where the immediate is the predicted result that the instruction is likely to produce during its n th next dynamic execution. Executing this modified instruction stream (i.e., the prefetching thread) on another core allows to compute the future results of the instructions that are not directly predictable, issue prefetches into the shared memory hierarchy, and thus reduce the primary threads' memory access time. We demonstrate the viability and effectiveness of future execution by performing cycle-accurate simulations of a two-way CMP running the single-threaded SPECcpu2000 benchmark suite. Our mechanism improves program performance by 12%, on average, over a baseline that already includes an optimized hardware stream prefetcher. We further show that FE is complementary to runahead execution and that the combination of these two techniques raises the average speedup to 20% above the performance of the baseline processor with the aggressive stream prefetcher.
Evaluating trace cache energy efficiencyCo, Michele; Weikle, Dee A. B.; Skadron, Kevin
doi: 10.1145/1187976.1187980pmid: N/A
Future fetch engines need to be energy efficient. Much research has focused on improving fetch bandwidth. In particular, previous research shows that storing concatenated basic blocks to form instruction traces can significantly improve fetch performance. This work evaluates whether this concatenating of basic blocks translates to significant energy-efficiency gains. We compare processor performance and energy efficiency in trace caches compared to instruction caches. We find that, although trace caches modestly outperform instruction cache only alternatives, it is branch-prediction accuracy that really determines performance and energy efficiency. When access delay and area restrictions are considered, our results show that sequential trace caches achieve very similar performance and energy efficiency results compared to instruction cache-based fetch engines and show that the trace cache's failure to significantly outperform the instruction cache-based fetch organizations stems from the poorer implicit branch prediction from the next-trace predictor at smaller areas. Because access delay limits the theoretical performance of the evaluated fetch engines, we also propose a novel ahead-pipelined next-trace predictor. Our results show that an STC fetch organization with a three-stage, ahead-pipelined next-trace predictor can achieve 5--17% IPC and 29% ED 2 improvements over conventional, unpipelined organizations.
Effective management of multiple configurable units using dynamic optimizationHu, Shiwen; Valluri, Madhavi; John, Lizy Kurian
doi: 10.1145/1187976.1187981pmid: N/A
As one of the promising efforts to minimize the surging microprocessor power consumption, adaptive computing environments (ACEs), where microarchitectural resources can be dynamically tuned to match a program's run-time requirement and characteristics, are becoming increasingly common. In an ACE, efficient management of the configurable units (CUs) is vital for maximizing the benefit of resource adaptation. ACEs usually have multiple configurable hardware units, necessitating exploration of a large number of combinatorial configurations in order to identify the most energy-efficient configuration. In this paper, we propose an ACE management framework for efficient management of multiple CUs, utilizing dynamic optimization systems' inherent capabilities of detecting and optimizing program hotspots, i.e., dominate code regions. We develop a scheme where hotpot boundaries are used for phase detection and adaptation. The framework achieves good energy reduction on managing multiple CUs with minimal hardware requirements and low implement cost by leveraging the existing infrastructure of a dynamic optimization system. The proposed framework is evaluated by dynamically adapting five CUs with distinct reconfiguration latencies and overheads. Those CUs are issue queue, reorder buffer, level-one data and instruction caches, and level-two cache. Previous research indicates that those five components dominate the energy consumption of a microprocessor. Despite the growing complexity and overhead of adapting five CUs, our technique reduces the energy consumption of those CUs by as much as 45%, while one of the best techniques provided by prior literature achieves less than 15% energy reduction for all CUs.