Wells, Philip M.; Chakraborty, Koushik; Sohi, Gurindar S.
doi: 10.1145/1531793.1531797pmid: N/A
As the computing industry enters the multicore era, exponential growth in the number of transistors on a chip continues to present challenges and opportunities for computer architects and system designers. We examine one emerging issue in particular: that of dynamic heterogeneity, which can arise, even among physically homogeneous cores, from changing reliability, power, or thermal conditions, different cache and TLB contents, or changing resource configurations. This heterogeneity results in a constantly varying pool of hardware resources, which greatly complicates software's traditional task of assigning computation to cores. In part to address dynamic heterogeneity, we argue that hardware should take a more active role in the management of its computation resources. We propose hardware techniques to virtualize the cores of a multicore processor, allowing hardware to flexibly reassign the virtual processors that are exposed, even to a single operating system, to any subset of the physical cores. We show that multicore virtualization operates with minimal overhead, and that it enables several novel resource management applications for improving both performance and reliability.
Nagarajan, Vijay; Gupta, Rajiv
doi: 10.1145/1531793.1531798pmid: N/A
Runtime monitoring support serves as a foundation for the important tasks of providing security, performing debugging, and improving performance of applications. Often runtime monitoring requires the maintenance of information associated with each of the application's original memory location, which is held in corresponding shadow memory locations. Unfortunately, existing robust shadow memory implementations are inefficient. In this paper, we present OASES: OS and Architectural Support for Efficient Shadow memory implementation for multicores that is also robust. A combination of operating system support (in the form of coupled allocation of memory pages used by the application and associated shadow memory pages) and architectural support (in the form of ISA support and exposed cache events) is proposed. Our page allocation policy enables fast translation of original addresses into corresponding shadow memory addresses; thus allowing implicit addressing of shadow memory. By exposing the cache events to the software, we ensure in software that the shadow memory instructions execute atomically with their corresponding original memory instructions. Our experiments show that the overheads of runtime monitoring tasks are significantly reduced in comparison to previous software implementations.
Rafique, M. Mustafa; Rose, Benjamin; Butt, Ali R.; Nikolopoulos, Dimitrios S.
doi: 10.1145/1531793.1531800pmid: N/A
Asymmetric multi-core processors (AMPs) with general-purpose and specialized cores packaged on the same chip, are emerging as a leading paradigm for high-end computing. A large body of existing research explores the use of standalone AMPs in computationally challenging and data-intensive applications. AMPs are rapidly deployed as high-performance accelerators on clusters. In these settings, scheduling, communication and I/O are managed by generalpurpose processors (GPPs), while computation is off-loaded to AMPs. Design space exploration for the configuration and software stack of hybrid clusters of AMPs and GPPs is an open problem. In this paper, we explore this design space in an implementation of the popular MapReduce programming model. Our contributions are: An exploration of various design alternatives for hybrid asymmetric clusters of AMPs and GPPs; the adoption of a streaming approach to supporting MapReduce computations on clusters with asymmetric components; and adaptive schedulers that take into account individual component capabilities in asymmetric clusters. Throughout our design, we remove I/O bottlenecks, using double-buffering and asynchronous I/O. We present an evaluation of the design choices through experiments on a real cluster with MapReduce workloads of varying degrees of computation intensity. We find that in a cluster with resource-constrained and well-provisioned AMP accelerators, a streaming approach achieves 50.5% and 73.1% better performance compared to the non-streaming approach, respectively, and scales almost linearly with increasing number of compute nodes.We also show that our dynamic scheduling mechanisms adapt effectively the parameters of the scheduling policies between applications with different computation density.
Strong, Richard; Mudigonda, Jayaram; Mogul, Jeffrey C.; Binkert, Nathan; Tullsen, Dean
doi: 10.1145/1531793.1531801pmid: N/A
We address the software costs of switching threads between cores in a multicore processor. Fast core switching enables a variety of potential improvements, such as thread migration for thermal management, fine-grained load balancing, and exploiting asymmetric multicores, where performance asymmetry creates opportunities for more efficient resource utilization. Successful exploitation of these opportunities demands low core-switching costs. We describe our implementation of core switching in the Linux kernel, as well as software changes that can decrease switching costs. We use detailed simulations to evaluate several alternative implementations. We also explore how some simple architectural variations can reduce switching costs. We evaluate system efficiency using both real (but symmetric) hardware, and simulated asymmetric hardware, using both microbenchmarks and realistic applications.
Bratanov, Stanislav; Belenov, Roman; Manovich, Nikita
doi: 10.1145/1531793.1531802pmid: N/A
This article addresses a problem of performance monitoring inside virtual machines (VMs). It advocates focused monitoring of particular virtualized programs, explains the need for and the importance of such an approach to performance monitoring in virtualized execution environments, and emphasizes its benefits for virtual machine manufacturers, virtual machine users (mostly, software developers) and hardware (processor) manufacturers. The article defines the problem of in-VM performance monitoring as the ability to employ modern methods and hardware performance monitoring capabilities inside virtual machines to an extent comparable with what is being done in real environments. Unfortunately, there are numerous reasons preventing us from achieving such an ambitious goal, one of those reasons being the lack of support from virtualization engines; that is why a novel method of 'cooperative' performance data collection is disclosed. The method implies collection of performance data at physical hardware and simultaneous tracking of software states inside a virtual machine. Each statistically visible execution point of the virtualized software may then be associated with information on real hardware events. The method effectively enables time-based sampling of virtualized workloads combined with hardware event counting, is applicable to unmodified, commercially available virtual machines, and has competitive precision and overhead. The practical significance and value of the method are further illustrated by studying a parallel workload and uncovering virtualization-specific performance issues of multithreaded programs.
Azimi, Reza; Tam, David K.; Soares, Livio; Stumm, Michael
doi: 10.1145/1531793.1531803pmid: N/A
Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling. We demonstrate three case studies on how a multicore-aware operating system can use these online capabilities for (1) determining cache partition sizes, which helps reduce contention in the shared cache among applications, (2) detecting memory regions with bad cache usage, which helps in isolating these regions to reduce cache pollution, and (3) detecting sharing among threads, which helps in clustering threads to improve locality. Using realistic applications from standard benchmark suites, the following performance improvements were achieved: (1) up to 27% improvement in IPC (instructions-per-cycle) due to cache partition sizing; (2) up to 10% reduction in cache miss rates due to reduced cache pollution, resulting in up to 7% improvement in IPC; and (3) up to 70% reduction in remote cache accesses due to thread clustering, resulting in up to 7% application-level improvement.
Shelepov, Daniel; Saez Alcaide, Juan Carlos; Jeffery, Stacey; Fedorova, Alexandra; Perez, Nestor; Huang, Zhi Feng; Blagodurov, Sergey; Kumar, Viren
doi: 10.1145/1531793.1531804pmid: N/A
Future heterogeneous single-ISA multicore processors will have an edge in potential performance per watt over comparable homogeneous processors. To fully tap into that potential, the OS scheduler needs to be heterogeneity-aware, so it can match jobs to cores according to characteristics of both. We propose a Heterogeneity-Aware Signature-Supported scheduling algorithm that does the matching using per-thread architectural signatures, which are compact summaries of threads' architectural properties collected offline. The resulting algorithm does not rely on dynamic profiling, and is comparatively simple and scalable. We implemented HASS in OpenSolaris, and achieved average workload speedups of up to 13%, matching best static assignment, achievable only by an oracle. We have also implemented a dynamic IPC-driven algorithm proposed earlier that relies on online profiling. We found that the complexity, load imbalance and associated performance degradation resulting from dynamic profiling are significant challenges to using this algorithm successfully. As a result it failed to deliver expected performance gains and to outperform HASS.
Wentzlaff, David; Agarwal, Anant
doi: 10.1145/1531793.1531805pmid: N/A
The next decade will afford us computer chips with 100's to 1,000's of cores on a single piece of silicon. Contemporary operating systems have been designed to operate on a single core or small number of cores and hence are not well suited to manage and provide operating system services at such large scale. If multicore trends continue, the number of cores that an operating system will be managing will continue to double every 18 months. The traditional evolutionary approach of redesigning OS subsystems when there is insufficient parallelism will cease to work because the rate of increasing parallelism will far outpace the rate at which OS designers will be capable of redesigning subsystems. The fundamental design of operating systems and operating system data structures must be rethought to put scalability as the prime design constraint. This work begins by documenting the scalability problems of contemporary operating systems. These studies are used to motivate the design of a factored operating system (fos). fos is a new operating system targeting manycore systems with scalability as the primary design constraint, where space sharing replaces time sharing to increase scalability.We describe fos, which is built in a message passing manner, out of a collection of Internet inspired services. Each operating system service is factored into a set of communicating servers which in aggregate implement a system service. These servers are designed much in the way that distributed Internet services are designed, but instead of providing high level Internet services, these servers provide traditional kernel services and replace traditional kernel data structures in a factored, spatially distributed manner. fos replaces time sharing with space sharing. In other words, fos's servers are bound to distinct processing cores and by doing so do not fight with end user applications for implicit resources such as TLBs and caches. We describe how fos's design is well suited to attack the scalability challenge of future multicores and discuss how traditional application-operating systems interfaces can be redesigned to improve scalability.
Moreto, Miquel; Cazorla, Francisco J.; Ramirez, Alex; Sakellariou, Rizos; Valero, Mateo
doi: 10.1145/1531793.1531806pmid: N/A
Current multicore architectures offer high throughput by increasing hardware resource utilization. As the number of cores in a multicore system increases, providing Quality of Service (QoS) to applications in addition to throughput is becoming an important problem. In this work, we present FlexDCP, a framework that allows the Operating System (OS) to guarantee a QoS for each application running in a chip multiprocessor. FlexDCP directly estimates the performance of applications for different cache configurations instead of using indirect measures of performance like the number of misses. This information allows the OS to convert QoS requirements into resource assignments. Consequently, it offers more flexibility to the OS as it can optimize different QoS metrics like per-application performance or global performance metrics such as fairness, weighted speed up or throughput. Our results show that FlexDCP is able to force applications in a workload to run at a certain percentage of their maximum performance in 94% of the cases considered, being on average 1:48% under the objective for remaining cases. When optimizing a global QoS metric like fairness, FlexDCP consistently outperforms traditional eviction policies like LRU, pseudo LRU and previous dynamic cache partitioning proposals for two-, four- and eightcore configurations. In an eight-core architecture FlexDCP obtains a fairness improvement of 10:1% over Fair, the best policy in the literature optimizing fairness.
Showing 1 to 10 of 18 Articles