1987 Hardware architectures for programming languages and programming languages for hardware architectures Nicklaus Wirth Pages: 2 - 8 doi>10.1145/36206.36178 Full text: Pdf Programming Languages and Operating Systems introduce abstractions which allow the programmer to ignore details of an implementation. Support of an abstraction must not only concentrate on promoting the efficiency of an implementation, but also on providing ... expand VLSI assist for a multiprocessor Bob Beck, Bob Kasten, Shreekant Thakkar Pages: 10 - 20 doi>10.1145/36206.36179 Full text: Pdf Multiprocessors have long been of interest to computer community. They provide the potential for accelerating applications through parallelism and increased throughput for large multi-user system. Three factors have limited the commercial success of ... expand Architectural support for multilanguage parallel programming on heterogeneous systems Roberto Bisiani, Alessandro Forin Pages: 21 - 30 doi>10.1145/36206.36180 Full text: Pdf We have designed and implemented a software facility, called Agora, that supports the development of parallel applications written in multiple languages. At the core of Agora there is a mechanism that allows concurrent computations to share data structures ... expand Machine-independent virtual memory management for paged uniprocessor and multiprocessor architectures Richard Rashid, Avadis Tevanian, Michael Young, David Golub, Robert Baron, David Black, William Bolosky, Jonathan Chew Pages: 31 - 39 doi>10.1145/36206.36181 Full text: Pdf This paper describes the design and implementation of virtual memory management within the CMU Mach Operating System and the experiences gained by the Mach kernel group in porting that system to a variety of architectures. As of this writing, Mach runs ... expand An architecture for the direct execution of the Forth programming language John R. Hayes, Martin E. Fraeman, Robert L. Williams, Thomas Zaremba Pages: 42 - 49 doi>10.1145/36206.36182 Full text: Pdf We have developed a simple direct execution architecture for a 32 bit Forth microprocessor. The processor can directly access a linear address space of over 4 gigawords. Two instruction types are defined; a subroutine call, and a user defined microcode ... expand Tags and type checking in LISP: hardware and software approaches Peter Steenkiste, John Hennessy Pages: 50 - 59 doi>10.1145/36206.36183 Full text: Pdf One of the major factors that distinguishes LISP from many other languages (Pascal, C, Fortran, etc.) is the need for run-time type checking. Run-time type checking is implemented by adding to each data object a tag that encodes type information. Tags ... expand The effect of instruction set complexity on program size and memory performance Jack W. Davidson, Richard A. Vaughan Pages: 60 - 64 doi>10.1145/36206.36184 Full text: Pdf One potential disadvantage of a machine with a reduced instruction set is that object programs may be substantially larger than those for a machine with a richer, more complex instruction set. The main reason is that a small instruction set will require ... expand The dragon processor Russell R. Atkinson, Edward M. McCreight Pages: 65 - 69 doi>10.1145/36206.36185 Full text: Pdf The Xerox PARC Dragon is a VLSI research computer that uses several techniques to achieve dense code and fast procedure calls in a system that can support multiple processors on a central high bandwidth memory bus. expand Coherency for multiprocessor virtual address caches James R. Goodman Pages: 72 - 81 doi>10.1145/36206.36186 Full text: Pdf A multiprocessor cache memory system is described that supplies data to the processor based on virtual addresses, but maintains consistency in the main memory, both across caches and across virtual address spaces. Pages in the same or different address ... expand Cheap hardware support for software debugging and profiling T. A. Cargill, B. N. Locanthi Pages: 82 - 83 doi>10.1145/36206.36187 Full text: Pdf We wish to determine the effectiveness of some simple hardware for debugging and profiling compiled programs on a conventional processor. The hardware cost is small -- a counter decremented on each instruction that raises an exception when its value ... expand An experimental coprocessor for implementing persistent objects on an IBM 4381 C. J. Georgiou, S. L. Palmer, P. L. Rosenfeld Pages: 84 - 87 doi>10.1145/36206.36188 Full text: Pdf In this paper we describe an experimental coprocessor for an IBM 4381 that is designed to facilitate the exploration of persistent objects. expand Integer multiplication and division on the HP precision architecture Daniel J. Magenheimer, Liz Peters, Karl Pettis, Dan Zuras Pages: 90 - 99 doi>10.1145/36206.36189 Full text: Pdf In recent years, many architectural design efforts have focused on maximizing performance for frequently executed, simple instructions. Although these efforts have resulted in machines with better average price/performance ratios, certain complex instructions ... expand The Mahler experience: using an intermediate language as the machine description David W. Wall, Michael L. Powell Pages: 100 - 104 doi>10.1145/36206.36190 Full text: Pdf Division of a compiler into a front end and a back end that communicate via an intermediate language is a well-known technique. We go farther and use the intermediate language as the official description of a family of machines with simple instruction ... expand A study of scalar compilation techniques for pipelined supercomputers Shlomo Weiss, James E. Smith Pages: 105 - 109 doi>10.1145/36206.36191 Full text: Pdf This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size ... expand Compiling Smalltalk-80 to a RISC William R. Bush, A. Dain Samples, David Ungar, Paul N. Hilfinger Pages: 112 - 116 doi>10.1145/36206.36192 Full text: Pdf The Smalltalk On A RISC project at U. C. Berkeley proves that a high-level object-oriented language can attain high performance on a modified reduced instruction set architecture. The single most important optimization is the removal of a layer of interpretation, ... expand How many addressing modes are enough? F. Chow, S. Correll, M. Himelstein, E. Killian, L. Weber Pages: 117 - 121 doi>10.1145/36206.36193 Full text: Pdf Programs naturally require a variety of memory-addressing modes. It isn't necessary to provide them in hardware, however, if a compiler can synthesize them from a few primitive modes. This not only simplifies the hardware, but also permits the compiler ... expand Superoptimizer: a look at the smallest program Henry Massalin Pages: 122 - 126 doi>10.1145/36206.36194 Full text: Pdf Given an instruction set, the superoptimizer finds the shortest program to compute a function. Startling programs have been generated, many of them engaging in convoluted bit-fiddling bearing little resemblance to the source programs which defined the ... expand Performance and architectural evaluation of the PSI machine Kazuo Taki, Katzuto Nakajima, Hiroshi Nakashima, Morihiro Ikeda Pages: 128 - 135 doi>10.1145/36206.36195 Full text: Pdf We evaluated a Prolog machine PSI (Personal Sequential Inference machine) for the purpose of improving and redesigning it. In this evaluation, we measured the execution speed and the dynamic characteristics of cache memory, register file, and branching ... expand RISCs vs. CISCs for Prolog: a case study Gaetano Borriello, Andrew R. Cherenson, Peter B. Danzig, Michael N. Nelson Pages: 136 - 145 doi>10.1145/36206.36196 Full text: Pdf This paper compares the performance of executing compiled Prolog code on two different architectures under development at U. C. Berkeley. The first is the PLM, a special-purpose CISC architecture intended as a coprocessor for a host machine. The second ... expand A RISC architecture for symbolic computation Richard B. Kieburtz Pages: 146 - 155 doi>10.1145/36206.36197 Full text: Pdf The G-machine is a language-directed processor architecture designed to support graph reduction as a model of computation. It can carry out lazy evaluation of functional language programs and can evaluate programs in which logical variables are used. ... expand Design tradeoffs to support the C programming language in the CRISP microprocessor David R. Ditzel, Hubert R. McLellan, Alan D. Berenbaum Pages: 158 - 163 doi>10.1145/36206.36198 Full text: Pdf Firefly: a multiprocessor workstation Charles P. Thacker, Lawrence C. Stewart Pages: 164 - 172 doi>10.1145/36206.36199 Full text: Pdf Firefly is a shared-memory multiprocessor workstation that contains from one to seven MicroVAX 78032 processors, each with a floating point unit and a sixteen kilobyte cache. The caches are coherent, so that all processors see a consistent view of main ... expand Pipelining and performance in the VAX 8800 processor Douglas W. Clark Pages: 173 - 177 doi>10.1145/36206.36200 Full text: Pdf The VAX 8800 family (models 8800, 8700, 8550), currently the fastest computers in the VAX product line, achieve their speed through a combination of fast cycle time and deep pipelining. Rather than pipeline highly variable VAX instructions as such, the ... expand A VLIW architecture for a trace scheduling compiler Robert P. Colwell, Robert P. Nix, John J. O'Donnell, David B. Papworth, Paul K. Rodman Pages: 180 - 192 doi>10.1145/36206.36201 Full text: Pdf Very Long Instruction Word (VLIW) architectures were promised to deliver far more than the factor of two or three that current architectures achieve from overlapped execution. Using a new type of compiler which compacts ordinary sequential code into ... expand Parallel computers for graphics applications Adam Levinthal, Pat Hanrahan, Mike Paquette, Jim Lawson Pages: 193 - 198 doi>10.1145/36206.36202 Full text: Pdf Specialized computer architectures can provide better price/performance for executing image processing and graphics applications than general purpose designs. Two processors are presented that use parallel SIMD data paths to support common graphics data ... expand The ZS-1 central processor J. E. Smith, G. E. Dermer, B. D. Vanderwarn, S. D. Klinger, C. M. Rozewski Pages: 199 - 204 doi>10.1145/36206.36203 Full text: Pdf The Astronautics ZS-1 is a high speed, 64-bit computer system designed for scientific and engineering applications. The ZS-1 central processor uses a decoupled architecture, which splits instructions into two streams---one for fixed point/memory address ... expand 1989 Architecture and compiler tradeoffs for a long instruction wordprocessor Robert Cohn, Thomas Gross, Monica Lam Pages: 2 - 14 doi>10.1145/70082.68183 Full text: Pdf A very long instruction word (VLIW) processor exploits parallelism by controlling multiple operations in a single instruction word. This paper describes the architecture and compiler tradeoffs in the design of iWarp, a VLIW single-chip microprocessor ... expand Tradeoffs in instruction format design for horizontal architectures Gurindar S. Sohi, Sriram Vajapeyam Pages: 15 - 25 doi>10.1145/70082.68184 Full text: Pdf With recent improvements in software techniques and the enhanced level of fine grain parallelism made available by such techniques, there has been an increased interest in horizontal architectures and large instruction words that are capable of issuing ... expand Overlapped loop support in the Cydra 5 James C. Dehnert, Peter Y.-T. Hsu, Joseph P. Bratt Pages: 26 - 38 doi>10.1145/70082.68185 Full text: Pdf The CydraTM 5 architecture adds unique support for overlapping successive iterations of a loop to a very long instruction word (VLIW) base. This architecture allows highly parallel loop execution for a much larger ... expand Architectural support for synchronous task communication F. J. Burkowski, G. V. Cormack, G. D. P. Dueck Pages: 40 - 53 doi>10.1145/70082.68186 Full text: Pdf This paper describes the motivation for a set of intertask communication primitives, the hardware support of these primitives, the architecture used in the Sylvan project which studies these issues, and the experience gained from various experiments ... expand The fuzzy barrier: a mechanism for high speed synchronization of processors Rajiv Gupta Pages: 54 - 63 doi>10.1145/70082.68187 Full text: Pdf Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared ... expand Efficient synchronization primitives for large-scale cache-coherent multiprocessors James R. Goodman, Mary K. Vernon, Philip J. Woest Pages: 64 - 75 doi>10.1145/70082.68188 Full text: Pdf This paper proposes a set of efficient primitives for process synchronization in multiprocessors. The only assumptions made in developing the set of primitives are that hardware combining is not implemented in the inter-connect, and (in one case) that ... expand A software instruction counter J. M. Mellor-Crummey, T. J. LeBlanc Pages: 78 - 86 doi>10.1145/70082.68189 Full text: Pdf Although several recent papers have proposed architectural support for program debugging and profiling, most processors do not yet provide even basic facilities, such as an instruction counter. As a result, system developers have been forced to invent ... expand Efficient debugging primitives for multiprocessors Z. Aral, I. Gerther, G. Schaffer Pages: 87 - 95 doi>10.1145/70082.68190 Full text: Pdf Existing kernel-level debugging primitives are inappropriate for instrumenting complex sequential or parallel programs. These functions incur a heavy overhead in their use of system calls and process switches. Context switches are used to alternately ... expand Sheaved memory: architectural support for state saving and restoration in pages systems M. E. Staknis Pages: 96 - 102 doi>10.1145/70082.68191 Full text: Pdf The concept of read-one/write-many paged memory is introduced and given the name sheaved memory. It is shown that sheaved memory is useful for efficiently maintaining checkpoints in main memory and for providing state saving and state ... expand Reference history, page size, and migration daemons in local/remote architectures M. A. Holliday Pages: 104 - 112 doi>10.1145/70082.68192 Full text: Pdf We address the problem of paged main memory management in the local/remote architecture subclass of shared memory multiprocessors. We consider the case where the operating system has primary responsibility and uses page migration as its main tool. We ... expand Translation lookaside buffer consistency: a software approach D. L. Black, R. F. Rashid, D. B. Golub, C. R. Hill Pages: 113 - 122 doi>10.1145/70082.68193 Full text: Pdf We discuss the translation lookaside buffer (TLB) consistency problem for multiprocessors, and introduce the Mach shootdown algorithm for maintaining TLB consistency in software. This algorithm has been implemented on several multiprocessors, and is ... expand Failure correction techniques for large disk arrays G. A. Gibson, L. Hellerstein, R. M. Karp, D. A. Patterson Pages: 123 - 132 doi>10.1145/70082.68194 Full text: Pdf The ever increasing need for I/O bandwidth will be met with ever larger arrays of disks. These arrays require redundancy to protect against data loss. This paper examines alternative choices for encodings, or codes, that reliably store information ... expand A unified vector/scalar floating-point architecture N. P. Jouppi, J. Bertoni, D. W. Wall Pages: 134 - 143 doi>10.1145/70082.68195 Full text: Pdf In this paper we present a unified approach to vector and scalar computation, using a single register file for both scalar operands and vector elements. The goal of this architecture is to yield improved scalar performance while broadening the range ... expand Data buffering: run-time versus compile-time support H. Mulder Pages: 144 - 151 doi>10.1145/70082.68196 Full text: Pdf Data-dependency, branch, and memory-access penalties are main constraints on the performance of high-speed microprocessors. The memory-access penalties concern both penalties imposed by external memory (e.g. cache) or by under utilization of the local ... expand An analysis of 8086 instruction set usage in MS DOS programs T. L. Adams, R. E. Zimmerman Pages: 152 - 160 doi>10.1145/70082.68197 Full text: Pdf A real-time support processor for ada tasking J. Roos Pages: 162 - 171 doi>10.1145/70082.68198 Full text: Pdf Task synchronization in Ada causes excessive run-time overhead due to the complex semantics of the rendezvous. To demonstrate that the speed can be increased by two orders of magnitude by using special purpose hardware, a single chip VLSI support processor ... expand The runtime environment for Scheme, a Scheme implementation on the 88000 Steven R. Vegdahl, Uwe F. Pleban Pages: 172 - 182 doi>10.1145/70082.68199 Full text: Pdf We are implementing a Scheme development system for the Motorola 88000. The core of the implementation is an optimizing native code compiler, together with a carefully designed runtime system. This paper describes our experiences with the 88000 as a ... expand Program optimization for instruction caches S. McFarling Pages: 183 - 191 doi>10.1145/70082.68200 Full text: Pdf This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and ... expand Using registers to optimize cross-domain call performance Paul A. Karger Pages: 194 - 204 doi>10.1145/70082.68201 Full text: Pdf This paper describes a new technique to improve the performance of cross-domain calls and returns in a capability-based computer system. Using register optimization information obtained from the compiler, a trusted linker can minimize the number of registers ... expand The design of nectar: a network backplane for heterogeneous multicomputers Emmanuel Arnould, H. T. Kung, Francois Bitz, Robert D. Sansom, Eric C. Cooperm Pages: 205 - 216 doi>10.1145/70082.68202 Full text: Pdf Nectar is a Ònetwork backplaneÓ for use in heterogeneous multicomputers. The initial system consists of a star-shaped fiber-optic network with an aggregate bandwidth of 1.6 gigabits/second and a switching latency of 700 nanoseconds. The system ... expand A message driven OR-parallel machine S. A. Delgado-Rannauro, T. J. Reynolds Pages: 217 - 228 doi>10.1145/70082.68203 Full text: Pdf A message driven architecture for the execution of OR-parallel logic languages is proposed. The computational model is based on well known compilation techniques for Logic Languages. We present first the multiple binding mechanism for the OR-parallel ... expand Evaluating the performance of software cache coherence S. Owicki, A. Agarwal Pages: 230 - 242 doi>10.1145/70082.68204 Full text: Pdf In a shared-memory multiprocessor with private caches, cached copies of a data item must be kept consistent. This is called cache coherence. Both hardware and software coherence schemes have been proposed. Software techniques are attractive because they ... expand Analysis of cache invalidation patterns in multiprocessors W. Weber, A. Gupta Pages: 243 - 256 doi>10.1145/70082.68205 Full text: Pdf To make shared-memory multiprocessors scalable, researchers are now exploring cache coherence protocols that do not rely on broadcast, but instead send invalidation messages to individual caches that contain stale data. The feasibility of such directory-based ... expand The effect of sharing on the cache and bus performance of parallel programs S. J. Eggers, R. H. Katz Pages: 257 - 270 doi>10.1145/70082.68206 Full text: Pdf Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. ... expand Available instruction-level parallelism for superscalar and superpipelined machines N. P. Jouppi, D. W. Wall Pages: 272 - 282 doi>10.1145/70082.68207 Full text: Pdf Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to ... expand Micro-optimization of floating-point operations W. J. Dally Pages: 283 - 289 doi>10.1145/70082.68208 Full text: Pdf This paper describes micro-optimization, a technique for reducing the operation count and time required to perform floating-point calculations. Micro-optimization involves breaking floating-point operations into their constituent micro-operations and ... expand Limits on multiple instruction issue M. D. Smith, M. Johnson, M. A. Horowitz Pages: 290 - 302 doi>10.1145/70082.68209 Full text: Pdf This paper investigates the limitations on designing a processor which can sustain an execution rate of greater than one instruction per cycle on highly-optimized, non-scientific applications. We have used trace-driven simulations to determine that these ... expand 1991 A variable instruction stream extension to the VLIW architecture Andrew Wolfe, John P. Shen Pages: 2 - 14 doi>10.1145/106972.106976 Full text: Pdf Reducing the branch penalty by rearranging instructions in a double-width memory Manolis Katevenis, Nestoras Tzartzanis Pages: 15 - 27 doi>10.1145/106972.106977 Full text: Pdf The floating point performance of a superscalar SPARC processor Roland L. Lee, Alex Y. Kwok, FayŽ A. Briggs Pages: 28 - 37 doi>10.1145/106972.106978 Full text: Pdf Software prefetching David Callahan, Ken Kennedy, Allan Porterfield Pages: 40 - 52 doi>10.1145/106972.106979 Full text: Pdf High-bandwidth data memory systems for superscalar processors Gurindar S. Sohi, Manoj Franklin Pages: 53 - 62 doi>10.1145/106972.106980 Full text: Pdf The cache performance and optimizations of blocked algorithms Monica D. Lam, Edward E. Rothberg, Michael E. Wolf Pages: 63 - 74 doi>10.1145/106972.106981 Full text: Pdf The effect of context switches on cache performance Jeffrey C. Mogul, Anita Borg Pages: 75 - 84 doi>10.1145/106972.106982 Full text: Pdf A portable interface for on-the-fly instruction space modification David Keppel Pages: 86 - 95 doi>10.1145/106972.106983 Full text: Pdf Virtual memory primitives for user programs Andrew W. Appel, Kai Li Pages: 96 - 107 doi>10.1145/106972.106984 Full text: Pdf The interaction of architecture and operating system design Thomas E. Anderson, Henry M. Levy, Brian N. Bershad, Edward D. Lazowska Pages: 108 - 120 doi>10.1145/106972.106985 Full text: Pdf Integrating register allocation and instruction scheduling for RISCs David G. Bradlee, Susan J. Eggers, Robert R. Henry Pages: 122 - 131 doi>10.1145/106972.106986 Full text: Pdf Code generation for streaming: an access/execute mechanism Manuel E. Benitez, Jack W. Davidson Pages: 132 - 141 doi>10.1145/106972.106987 Full text: Pdf Efficient Implementation of high-level parallel programs Rajive Bagrodia, Sharad Mathur Pages: 142 - 151 doi>10.1145/106972.376053 Full text: Pdf Vector register design for polycyclic vector scheduling William Mangione-Smith, Santosh G. Abraham, Edward S. Davidson Pages: 154 - 163 doi>10.1145/106972.328664 Full text: Pdf Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine David E. Culler, Anurag Sah, Klaus E. Schauser, Thorsten von Eicken, John Wawrzynek Pages: 164 - 175 doi>10.1145/106972.106990 Full text: Pdf Limits of instruction-level parallelism David W. Wall Pages: 176 - 188 doi>10.1145/106972.106991 Full text: Pdf Performance consequences of parity placement in disk arrays Edward K. Lee, Randy H. Katz Pages: 190 - 199 doi>10.1145/106972.106992 Full text: Pdf Combining the concepts of compression and caching for a two-level filesystem Vincent Cate, Thomas Gross Pages: 200 - 211 doi>10.1145/106972.106993 Full text: Pdf NUMA policies and their relation to memory architecture William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, Alan L. Cox Pages: 212 - 221 doi>10.1145/106972.106994 Full text: Pdf LimitLESS directories: A scalable cache coherence scheme David Chaiken, John Kubiatowicz, Anant Agarwal Pages: 224 - 234 doi>10.1145/106972.106995 Full text: Pdf An efficient cache-based access anomaly detection scheme Sang L. Min, Jong-Deok Choi Pages: 235 - 244 doi>10.1145/106972.106996 Full text: Pdf Performance evaluation of memory consistency models for shared-memory multiprocessors Kourosh Gharachorloo, Anoop Gupta, John Hennessy Pages: 245 - 257 doi>10.1145/106972.106997 Full text: Pdf Process coordination with fetch-and-increment Eric Freudenthal, Allan Gottlieb Pages: 260 - 268 doi>10.1145/106972.106998 Full text: Pdf Synchronization without contention John M. Mellor-Crummey, Michael L. Scott Pages: 269 - 278 doi>10.1145/106972.106999 Full text: Pdf The case for a read barrier Douglas Johnson Pages: 279 - 287 doi>10.1145/106972.107000 Full text: Pdf An analysis of MIPS and SPARC instruction set utilization on the SPEC benchmarks Robert F. Cmelik, Shing I. Kong, David R. Ditzel, Edmund J. Kelly Pages: 290 - 302 doi>10.1145/106972.107001 Full text: Pdf Performance characteristics of architectural features of the IBM RISC System/6000 C. Brian Hall, Kevin O'Brien Pages: 303 - 309 doi>10.1145/106972.107002 Full text: Pdf Performance from architecture: comparing a RISC and a CISC with similar hardware organization Dileep Bhandarkar, Douglas W. Clark Pages: 310 - 319 doi>10.1145/106972.107003 Full text: Pdf 1992 On-line data compression in a log-structured file system Michael Burrows, Charles Jerian, Butler Lampson, Timothy Mann Pages: 2 - 9 doi>10.1145/143365.143376 Full text: Pdf Non-volatile memory for fast, reliable file systems Mary Baker, Satoshi Asami, Etienne Deprit, John Ouseterhout, Margo Seltzer Pages: 10 - 22 doi>10.1145/143365.143380 Full text: Pdf Parity declustering for continuous operation in redundant disk arrays Mark Holland, Garth A. Gibson Pages: 23 - 35 doi>10.1145/143365.143383 Full text: Pdf Software support for speculative loads Anne Rogers, Kai Li Pages: 38 - 50 doi>10.1145/143365.143484 Full text: Pdf Reducing memory latency via non-blocking and prefetching caches Tien-Fu Chen, Jean-Loup Baer Pages: 51 - 61 doi>10.1145/143365.143486 Full text: Pdf Design and evaluation of a compiler algorithm for prefetching Todd C. Mowry, Monica S. Lam, Anoop Gupta Pages: 62 - 73 doi>10.1145/143365.143488 Full text: Pdf Improving the accuracy of dynamic branch prediction using branch correlation Shien-Tai Pan, Kimming So, Joseph T. Rahmeh Pages: 76 - 84 doi>10.1145/143365.143490 Full text: Pdf Predicting conditional branch directions from previous runs of a program Joseph A. Fisher, Stefan M. Freudenberger Pages: 85 - 95 doi>10.1145/143365.143493 Full text: Pdf High speed switch scheduling for local area networks Thomas E. Anderson, Susan S. Owicki, James B. Saxe, Charles P. Thacker Pages: 98 - 110 doi>10.1145/143365.143495 Full text: Pdf A tightly-coupled processor-network interface Dana S. Henry, Christopher F. Joerg Pages: 111 - 122 doi>10.1145/143365.143497 Full text: Pdf Consistency management for virtually indexed caches Bob Wheeler, Brian N. Bershad Pages: 124 - 136 doi>10.1145/143365.143499 Full text: Pdf Eliminating the address translation bottleneck for physical address cache Tzi-cker Chiueh, Randy H. Katz Pages: 137 - 148 doi>10.1145/143365.143501 Full text: Pdf A performance evaluation of optimal hybrid cache coherency protocols Jack E. Veenstra, Robert J. Fowler Pages: 149 - 160 doi>10.1145/143365.143503 Full text: Pdf Characterizing the caching and synchronization performance of a multiprocessor operating system Josep Torrellas, Anoop Gupta, John Hennessy Pages: 162 - 174 doi>10.1145/143365.143506 Full text: Pdf Architecture support for single address space operating systems Eric J. Koldinger, Jeffrey S. Chase, Susan J. Eggers Pages: 175 - 186 doi>10.1145/143365.143508 Full text: Pdf Application-controlled physical memory using external page-cache management Kieran Harty, David R. Cheriton Pages: 187 - 197 doi>10.1145/143365.143511 Full text: Pdf Efficient data breakpoints Robert Wahbe Pages: 200 - 212 doi>10.1145/143365.143518 Full text: Pdf Migrating a CISC computer family onto RISC via object code translation Kristy Andrews, Duane Sand Pages: 213 - 222 doi>10.1145/143365.143520 Full text: Pdf Fast mutual exclusion for uniprocessors Brian N. Bershad, David D. Redell, John R. Ellis Pages: 223 - 233 doi>10.1145/143365.143523 Full text: Pdf In this paper we describe restartable atomic sequences, an optimistic mechanism for implementing simple atomic operations (such as Test-And-Set) on a uniprocessor. A thread that is suspended within a restartable atomic ... expand Sentinel scheduling for VLIW and superscalar processors Scott A. Mahlke, William Y. Chen, Wen-mei W. Hwu, B. Ramakrishna Rau, Michael S. Schlansker Pages: 238 - 247 doi>10.1145/143365.143529 Full text: Pdf Speculative execution is an important source of parallelism for VLIW and superscalar processors. A serious challenge with compiler-controlled speculative execution is to accurately detect and report all program execution errors at the time of occurrence. ... expand Efficient superscalar performance through boosting Michael D. Smith, Mark Horowitz, Monica S. Lam Pages: 248 - 259 doi>10.1145/143365.143534 Full text: Pdf The foremost goal of superscalar processor design is to increase performance through the exploitation of instruction-level parallelism (ILP). Previous studies have shown that speculative execution is required for high instruction per cycle (IPC) rates ... expand Cooperative shared memory: software and hardware for scalable multiprocessor Mark D. Hill, James R. Larus, Steven K. Reinhardt, David A. Wood Pages: 262 - 273 doi>10.1145/143365.143537 Full text: Pdf We believe the absence of massively-parallel, shared-memory machines follows from the lack of a shared-memory programming performance model that can inform programmers of the cost of operations (so they can avoid expensive ones) and can tell hardware ... expand Closing the window of vulnerability in multiphase memory transactions John Kubiatowicz, David Chaiken, Anant Agarwal Pages: 274 - 284 doi>10.1145/143365.143540 Full text: Pdf Multiprocessor architects have begun to explore several mechanisms such as prefetching, context-switching and software-assisted dynamic cache-coherence, which transform single-phase memory transactions in conventional memory systems into multiphase operations. ... expand Access normalization: loop restructuring for NUMA compilers Wei Li, Keshav Pingali Pages: 285 - 295 doi>10.1145/143365.143541 Full text: Pdf In scalable parallel machines, processors can make local memory accesses much faster than they can make remote memory accesses. In addition, when a number of remote accesses must be made, it is usually more efficient to use block transfers of data rather ... expand 1994 Separating data and control transfer in distributed operating systems Chandramohan A. Thekkath, Henry M. Levy, Edward D. Lazowska Pages: 2 - 11 doi>10.1145/195473.195481 Full text: Pdf Advances in processor architecture and technology have resulted in workstations in the 100+ MIPS range. As well, newer local-area networks such as ATM promise a ten- to hundred-fold increase in throughput, much reduced latency, greater scalability, and ... expand Scheduling and page migration for multiprocessor compute servers Rohit Chandra, Scott Devine, Ben Verghese, Anoop Gupta, Mendel Rosenblum Pages: 12 - 24 doi>10.1145/195473.195485 Full text: Pdf Several cache-coherent shared-memory multiprocessors have been developed that are scalable and offer a very tight coupling between the processing resources. They are therefore quite attractive for use as compute servers for multiprogramming and parallel ... expand Reactive synchronization algorithms for multiprocessors Beng-Hong Lim, Anant Agarwal Pages: 25 - 35 doi>10.1145/195473.195490 Full text: Pdf Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable run-time factors. The designer of a synchronization algorithm has a choice ... expand Integration of message passing and shared memory in the Stanford FLASH multiprocessor John Heinlein, Kourosh Gharachorloo, Scott Dresser, Anoop Gupta Pages: 38 - 50 doi>10.1145/195473.195494 Full text: Pdf The advantages of using message passing over shared memory for certain types of communication and synchronization have provided an incentive to integrate both models within a single architecture. A key goal of the FLASH (FLexible Architecture for SHared ... expand Software overhead in messaging layers: where does the time go? Vijay Karamcheti, Andrew A. Chien Pages: 51 - 60 doi>10.1145/195473.195499 Full text: Pdf Despite improvements in network interfaces and software messaging layers, software communication overhead still dominates the hardware routing cost in most systems. In this study, we identify the sources of this overhead by analyzing software costs of ... expand Where is time spent in message-passing and shared-memory programs? Satish Chandra, James R. Larus, Anne Rogers Pages: 61 - 73 doi>10.1145/195473.195501 Full text: Pdf Message passing and shared memory are two techniques parallel programs use for coordination and communication. This paper studies the strengths and weaknesses of these two mechanisms by comparing equivalent, well-written message-passing and shared-memory ... expand Performance of a hardware-assisted real-time garbage collector William J. Schmidt, Kelvin D. Nilsen Pages: 76 - 85 doi>10.1145/195473.195504 Full text: Pdf Hardware-assisted real-time garbage collection offers high throughput and small worst-case bounds on the times required to allocate dynamic objects and to access the memory contained within previously allocated objects. Whether the proposed technology ... expand eNVy: a non-volatile, main memory storage system Michael Wu, Willy Zwaenepoel Pages: 86 - 97 doi>10.1145/195473.195506 Full text: Pdf This paper describes the architecture of eNVy, a large non-volatile main memory storage system built primarily with Flash memory. eNVy presents its storage space as a linear, memory mapped array rather than as an emulated disk in order to provide an ... expand Resource allocation in a high clock rate microprocessor Michael Upton, Thomas Huff, Trevor Mudge, Richard Brown Pages: 98 - 109 doi>10.1145/195473.195510 Full text: Pdf This paper discusses the design of a high clock rate (300MHz) processor. The architecture is described, and the goals for the design are explained. The performance of three processor models is evaluated using trace-driven simulation. A cost model is ... expand Hardware and software support for efficient exception handling Chandramohan A. Thekkath, Henry M. Levy Pages: 110 - 119 doi>10.1145/195473.195515 Full text: Pdf Program-synchronous exceptions, for example, breakpoints, watchpoints, illegal opcodes, and memory access violations, provide information about exceptional conditions, interrupting the program and vectoring to an operating system handler. ... expand A technique for monitoring run-time dynamics of an operating system and a microprocessor executing user applications Pramod V. Argade, David K. Charles, Craig Taylor Pages: 122 - 131 doi>10.1145/195473.195518 Full text: Pdf In this paper, we present a non-invasive and efficient technique for simulating applications complete with their operating system interaction. The technique involves booting and initiating an application on a hardware development system, capturing the ... expand Trap-driven simulation with Tapeworm II Richard Uhlig, David Nagle, Trevor Mudge, Stuart Sechrest Pages: 132 - 144 doi>10.1145/195473.195521 Full text: Pdf Tapeworm II is a software-based simulation tool that evaluates the cache and TLB performance of multiple-task and operating system intensive workloads. Tapeworm resides in an OS kernel and causes a host machine's hardware to drive simulations with kernel ... expand Contrasting characteristics and cache performance of technical and multi-user commercial workloads Ann Marie Grizzaffi Maynard, Colette M. Donnelly, Bret R. Olszewski Pages: 145 - 156 doi>10.1145/195473.195524 Full text: Pdf Experience has shown that many widely used benchmarks are poor predictors of the performance of systems running commercial applications. Research into this anomaly has long been hampered by a lack of address traces from representative multi-user commercial ... expand Avoiding conflict misses dynamically in large direct-mapped caches Brian N. Bershad, Dennis Lee, Theodore H. Romer, J. Bradley Chen Pages: 158 - 170 doi>10.1145/195473.195527 Full text: Pdf This paper describes a method for improving the performance of a large direct-mapped cache by reducing the number of conflict misses. Our solution consists of two components: an inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that ... expand Surpassing the TLB performance of superpages with less operating system support Madhusudhan Talluri, Mark D. Hill Pages: 171 - 182 doi>10.1145/195473.195531 Full text: Pdf Many commercial microprocessor architectures have added translation lookaside buffer (TLB) support for superpages. Superpages differ from segments because their size must be a power of two multiple of the base page size ... expand Dynamic memory disambiguation using the memory conflict buffer David M. Gallagher, William Y. Chen, Scott A. Mahlke, John C. Gyllenhaal, Wen-mei W. Hwu Pages: 183 - 193 doi>10.1145/195473.195534 Full text: Pdf To exploit instruction level parallelism, compilers for VLIW and superscalar processors often employ static code scheduling. However, the available code reordering may be severely restricted due to ambiguous dependences between memory instructions. This ... expand AP1000+: architectural support of PUT/GET interface for parallelizing compiler Kenichi Hayashi, Tsunehisa Doi, Takeshi Horie, Yoichi Koyanagi, Osamu Shiraki, Nobutaka Imamura, Toshiyuki Shimizu, Hiroaki Ishihata, Tatsuya Shindo Pages: 196 - 207 doi>10.1145/195473.195538 Full text: Pdf The scalability of distributed-memory parallel computers makes them attractive candidates for solving large-scale problems. New languages, such as HPF, FortranD, and VPP Fortran, have been developed to enable existing software to be easily ported to ... expand LCM: memory system support for parallel language implementation James R. Larus, Brad Richards, Guhan Viswanathan Pages: 208 - 218 doi>10.1145/195473.195545 Full text: Pdf Higher-level parallel programming languages can be difficult to implement efficiently on parallel machines. This paper shows how a flexible, compiler-controlled memory system can help achieve good performance for language constructs that previously appeared ... expand The performance advantages of integrating block data transfer in cache-coherent multiprocessors Steven Cameron Woo, Jaswinder Pal Singh, John L. Hennessy Pages: 219 - 229 doi>10.1145/195473.195547 Full text: Pdf Integrating support for block data transfer has become an important emphasis in recent cache-coherent shared address space multiprocessors. This paper examines the potential performance benefits of adding this support. A set of ambitious hardware mechanisms ... expand Improving the accuracy of static branch prediction using branch correlation Cliff Young, Michael D. Smith Pages: 232 - 241 doi>10.1145/195473.195549 Full text: Pdf Recent work in history-based branch prediction uses novel hardware structures to capture branch correlation and increase branch prediction accuracy. We present a profile-based code transformation that exploits branch correlation to improve the accuracy ... expand Reducing branch costs via branch alignment Brad Calder, Dirk Grunwald Pages: 242 - 251 doi>10.1145/195473.195553 Full text: Pdf Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned ... expand Compiler optimizations for improving data locality Steve Carr, Kathryn S. McKinley, Chau-Wen Tseng Pages: 252 - 262 doi>10.1145/195473.195557 Full text: Pdf In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, ... expand DCG: an efficient, retargetable dynamic code generation system Dawson R. Engler, Todd A. Proebsting Pages: 263 - 272 doi>10.1145/195473.195567 Full text: Pdf Dynamic code generation allows aggressive optimization through the use of runtime information. Previous systems typically relied on ad hoc code generators that were not designed for retargetability, and did not shield the client from machine-specific ... expand The performance impact of flexibility in the Stanford FLASH multiprocessor Mark Heinrich, Jeffrey Kuskin, David Ofelt, John Heinlein, Joel Baxter, Jaswinder Pal Singh, Richard Simoni, Kourosh Gharachorloo, David Nakahira, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, John Hennessy Pages: 274 - 285 doi>10.1145/195473.195569 Full text: Pdf A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford ... expand Simple compiler algorithms to reduce ownership overhead in cache coherence protocols Jonas Skeppstedt, Per Stenstršm Pages: 286 - 296 doi>10.1145/195473.195572 Full text: Pdf We study in this paper the design and efficiency of compiler algorithms that remove ownership overhead in shared-memory multiprocessors with write-invalidate protocols. These algorithms detect loads followed by stores to the same address. Such loads ... expand Fine-grain access control for distributed shared memory Ioannis Schoinas, Babak Falsafi, Alvin R. Lebeck, Steven K. Reinhardt, James R. Larus, David A. Wood Pages: 297 - 306 doi>10.1145/195473.195575 Full text: Pdf This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper ... expand Interleaving: a multithreading technique targeting multiprocessors and workstations James Laudon, Anoop Gupta, Mark Horowitz Pages: 308 - 318 doi>10.1145/195473.195576 Full text: Pdf There is an increasing trend to use commodity microprocessors as the compute engines in large-scale multiprocessors. However, given that the majority of the microprocessors are sold in the workstation market, not in the multiprocessor market, it is only ... expand Hardware support for fast capability-based addressing Nicholas P. Carter, Stephen W. Keckler, William J. Dally Pages: 319 - 327 doi>10.1145/195473.195579 Full text: Pdf Traditional methods of providing protection in memory systems do so at the cost of increased context switch time and/or increased storage to record access permissions for processes. With the advent of computers that supported cycle-by-cycle multithreading, ... expand The effectiveness of multiple hardware contexts Radhika Thekkath, Susan J. Eggers Pages: 328 - 337 doi>10.1145/195473.195583 Full text: Pdf Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread working ... expand 1996 The case for a single-chip multiprocessor Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, Kunyung Chang Pages: 2 - 11 doi>10.1145/237090.237140 Full text: Pdf Advances in IC processing allow for more microprocessor design options. The increasing gate density and cost of wires in advanced integrated circuit technologies require that we look for new ways to use their capabilities effectively. This paper shows ... expand An evaluation of memory consistency models for shared-memory systems with ILP processors Vijay S. Pai, Parthasarathy Ranganathan, Sarita V. Adve, Tracy Harton Pages: 12 - 23 doi>10.1145/237090.237142 Full text: Pdf Relaxed consistency models have been shown to significantly outperform sequential consistency for single-issue, statically scheduled processors with blocking reads. However, current microprocessors aggressively exploit instruction-level parallelism (ILP) ... expand Synchronization and communication in the T3E multiprocessor Steven L. Scott Pages: 26 - 36 doi>10.1145/237090.237144 Full text: Pdf This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale ... expand Evaluation of architectural support for global address-based communication in large-scale parallel machines Arvind Krishnamurthy, Klaus E. Schauser, Chris J. Scheiman, Randolph Y. Wang, David E. Culler, Katherine Yelick Pages: 37 - 48 doi>10.1145/237090.237147 Full text: Pdf Large-scale parallel machines are incorporating increasingly sophisticated architectural support for user-level messaging and global memory access. We provide a systematic evaluation of a broad spectrum of current design alternatives based on our implementations ... expand Whole-program optimization for time and space efficient threads Dirk Grunwald, Richard Neves Pages: 50 - 59 doi>10.1145/237090.237149 Full text: Pdf Modern languages and operating systems often encourage programmers to use threads, or independent control streams, to mask the overhead of some operations and simplify program structure. Multitasking operating systems use threads to mask communication ... expand Thread scheduling for cache locality James Philbin, Jan Edler, Otto J. Anshus, Craig C. Douglas, Kai Li Pages: 60 - 71 doi>10.1145/237090.237151 Full text: Pdf This paper describes a method to improve the cache locality of sequential programs by scheduling fine-grained threads. The algorithm relies upon hints provided at the time of thread creation to determine a thread execution order likely to reduce cache ... expand The Rio file cache: surviving operating system crashes Peter M. Chen, Wee Teck Ng, Subhachandra Chandra, Christopher Aycock, Gurushankar Rajamani, David Lowell Pages: 74 - 83 doi>10.1145/237090.237154 Full text: Pdf One of the fundamental limits to high-performance, high-reliability file systems is memory's vulnerability to system crashes. Because memory is viewed as unsafe, systems periodically write data back to disk. The extra disk traffic lowers performance, ... expand Petal: distributed virtual disks Edward K. Lee, Chandramohan A. Thekkath Pages: 84 - 92 doi>10.1145/237090.237157 Full text: Pdf The ideal storage system is globally accessible, always available, provides unlimited performance and capacity for a large number of clients, and requires no management. This paper describes the design, implementation, and performance of Petal, a system ... expand A quantitative analysis of loop nest locality Kathryn S. McKinley, Olivier Temam Pages: 94 - 104 doi>10.1145/237090.237161 Full text: Pdf This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority ... expand The intrinsic bandwidth requirements of ordinary programs Andrew S. Huang, John Paul Shen Pages: 105 - 114 doi>10.1145/237090.237163 Full text: Pdf While there has been an abundance of recent papers on hardware and software approaches to improving the performance of memory accesses, few papers have addressed the problem from the program's point of view. There is a general notion that certain programs ... expand Multiple-block ahead branch predictors AndrŽ Seznec, StŽphan Jourdan, Pascal Sainrat, Pierre Michaud Pages: 116 - 127 doi>10.1145/237090.237169 Full text: Pdf A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel cost-effective mechanism called the two-block ahead branch predictor. Information from the current ... expand Analysis of branch prediction via data compression I-Cheng K. Chen, John T. Coffey, Trevor N. Mudge Pages: 128 - 137 doi>10.1145/237090.237171 Full text: Pdf Branch prediction is an important mechanism in modern microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the theoretical basis behind these prediction schemes. ... expand Value locality and load value prediction Mikko H. Lipasti, Christopher B. Wilkerson, John Paul Shen Pages: 138 - 147 doi>10.1145/237090.237173 Full text: Pdf Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, ... expand The structure and performance of interpreters Theodore H. Romer, Dennis Lee, Geoffrey M. Voelker, Alec Wolman, Wayne A. Wong, Jean-Loup Baer, Brian N. Bershad, Henry M. Levy Pages: 150 - 159 doi>10.1145/237090.237175 Full text: Pdf Interpreted languages have become increasingly popular due to demands for rapid program development, ease of use, portability, and safety. Beyond the general impression that they are "slow," however, little has been documented about the performance of ... expand Adapting to network and client variability via on-demand dynamic distillation Armando Fox, Steven D. Gribble, Eric A. Brewer, Elan Amir Pages: 160 - 170 doi>10.1145/237090.237177 Full text: Pdf The explosive growth of the Internet and the proliferation of smart cellular phones and handheld wireless devices is widening an already large gap between Internet clients. Clients vary in their hardware resources, software sophistication, and quality ... expand Shasta: a low overhead, software-only approach for supporting fine-grain shared memory Daniel J. Scales, Kourosh Gharachorloo, Chandramohan A. Thekkath Pages: 174 - 185 doi>10.1145/237090.237179 Full text: Pdf This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared ... expand An integrated compile-time/run-time software distributed shared memory system Sandhya Dwarkadas, Alan L. Cox, Willy Zwaenepoel Pages: 186 - 197 doi>10.1145/237090.237181 Full text: Pdf On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into ... expand Hiding communication latency and coherence overhead in software DSMs R. Bianchini, L. I. Kontothanassis, R. Pinto, M. De Maria, M. Abud, C. L. Amorim Pages: 198 - 209 doi>10.1145/237090.237185 Full text: Pdf In this paper we propose the use of a PCI-based programmable protocol controller for hiding communication and coherence overheads in software DSMs. Our protocol controller provides three different types of overhead tolerance: a) moving basic communication ... expand SoftFLASH: analyzing the performance of clustered distributed virtual shared memory Andrew Erlichson, Neal Nuckolls, Greg Chesson, John Hennessy Pages: 210 - 220 doi>10.1145/237090.237187 Full text: Pdf One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment ... expand Compiler-based prefetching for recursive data structures Chi-Keung Luk, Todd C. Mowry Pages: 222 - 233 doi>10.1145/237090.237190 Full text: Pdf Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. While prefetching has enjoyed considerable success in array-based numeric codes, ... expand Exploiting dual data-memory banks in digital signal processors Mazen A. R. Saghir, Paul Chow, Corinna G. Lee Pages: 234 - 243 doi>10.1145/237090.237193 Full text: Pdf Over the past decade, digital signal processors (DSPs) have emerged as the processors of choice for implementing embedded applications in high-volume consumer products. Through their use of specialized hardware features and small chip areas, DSPs provide ... expand Compiler-directed page coloring for multiprocessors Edouard Bugnion, Jennifer M. Anderson, Todd C. Mowry, Mendel Rosenblum, Monica S. Lam Pages: 244 - 255 doi>10.1145/237090.237195 Full text: Pdf This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. ... expand Reducing network latency using subpages in a global memory environment HervŽ A. Jamrozik, Michael J. Feeley, Geoffrey M. Voelker, James Evans, II, Anna R. Karlin, Henry M. Levy, Mary K. Vernon Pages: 258 - 267 doi>10.1145/237090.237198 Full text: Pdf New high-speed networks greatly encourage the use of network memory as a cache for virtual memory and file pages, thereby reducing the need for disk access. Because pages are the fundamental transfer and access units in remote memory systems, page size ... expand Improving cache performance with balanced tag and data paths Jih-Kwon Peir, Windsor W. Hsu, Honesty Young, Shauchi Ong Pages: 268 - 278 doi>10.1145/237090.237202 Full text: Pdf There are two concurrent paths in a typical cache access --- one through the data array and the other through the tag array. The path through the data array drives the selected set out of the array. The path through the tag array determines cache hit/miss ... expand Operating system support for improving data locality on CC-NUMA compute servers Ben Verghese, Scott Devine, Anoop Gupta, Mendel Rosenblum Pages: 279 - 289 doi>10.1145/237090.237205 Full text: Pdf The dominant architecture for the next generation of shared-memory multiprocessors is CC-NUMA (cache-coherent non-uniform memory architecture). These machines are attractive as compute servers because they provide transparent access to local and remote ... expand 1998 Compiler-controlled memory Keith D. Cooper, Timothy J. Harvey Pages: 2 - 11 doi>10.1145/291069.291010 Full text: Pdf Optimizations aimed at reducing the impact of memory operations on execution speed have long concentrated on improving cache performance. These efforts achieve a. reasonable level of success. The primary limit on the compiler's ability to improve memory ... expand Segregating heap objects by reference behavior and lifetime Matthew L. Seidl, Benjamin G. Zorn Pages: 12 - 23 doi>10.1145/291069.291012 Full text: Pdf Dynamic storage allocation has become increasingly important in many applications, in part due to the use of the object-oriented paradigm. At the same time, processor speeds are increasing faster than memory speeds and programs are increasing in size ... expand Schedule-independent storage mapping for loops Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Beth Simon Pages: 24 - 33 doi>10.1145/291069.291015 Full text: Pdf This paper studies the relationship between storage requirements and performance. Storage-related dependences inhibit optimizations for locality and parallelism. Techniques such as renaming and array expansion can eliminate all storage-related dependences, ... expand An empirical analysis of instruction repetition Avinash Sodani, Gurindar S. Sohi Pages: 35 - 45 doi>10.1145/291069.291016 Full text: Pdf We study the phenomenon of instruction repetition, where the inputs and outputs of multiple dynamic instances of a static instruction are repeated. We observe that over 80% of the dynamic instructions executed in several programs are repeated and most ... expand Space-time scheduling of instruction-level parallelism on a raw machine Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek Sarkar, Saman Amarasinghe Pages: 46 - 57 doi>10.1145/291069.291018 Full text: Pdf Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor distributes ... expand Data speculation support for a chip multiprocessor Lance Hammond, Mark Willey, Kunle Olukotun Pages: 58 - 69 doi>10.1145/291069.291020 Full text: Pdf Thread-level speculation is a technique that enables parallel execution of sequential applications on a multiprocessor. This paper describes the complete implementation of the support for threadlevel speculation on the Hydra chip multiprocessor (CMP). ... expand VISA: Netstation's virtual Internet SCSI adapter Rodney Van Meter, Gregory G. Finn, Steve Hotz Pages: 71 - 80 doi>10.1145/291069.291023 Full text: Pdf In this paper we describe the implementation of VISA, our Virtual Internet SCSI Adapter. VISA was built to evaluate the performance impact on the host operating system of using IP to communicate with peripherals, especially storage devices. We have built ... expand Active disks: programming model, algorithms and evaluation Anurag Acharya, Mustafa Uysal, Joel Saltz Pages: 81 - 91 doi>10.1145/291069.291026 Full text: Pdf Several application and technology trends indicate that it might be both profitable and feasible to move computation closer to the data that it processes. In this paper, we evaluate Active Disk architectures which integrate significant processing ... expand A cost-effective, high-bandwidth storage architecture Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang, Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, Jim Zelenka Pages: 92 - 103 doi>10.1145/291069.291029 Full text: Pdf This paper describes the Network-Attached Secure Disk (NASD) storage architecture, prototype implementations oj NASD drives, array management for our architecture, and three, filesystems built on our prototype. NASD provides scalable storage bandwidth ... expand Hardware-software trade-offs in a direct Rambus implementation of the RAMpage memory hierarchy Philip Machanick, Pierre Salverda, Lance Pompe Pages: 105 - 114 doi>10.1145/291069.291032 Full text: Pdf The RAMpage memory hierarchy is an alternative to the traditional division between cache and main memory: main memory is moved up a level and DRAM is used as a paging device. The idea behind RAMpage is to reduce hardware complexity, if at the cost of ... expand Dependence based prefetching for linked data structures Amir Roth, Andreas Moshovos, Gurindar S. Sohi Pages: 115 - 126 doi>10.1145/291069.291034 Full text: Pdf We introduce a dynamic scheme that captures the accesspat-terns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses ... expand Performance counters and state sharing annotations: a unified approach to thread locality Boris Weissman Pages: 127 - 138 doi>10.1145/291069.291035 Full text: Pdf This paper describes a combined approach for improving thread locality that uses the bardware performance monitors of modem processors and program-centric code annotations to guide thread scheduling on SMPs. The approach relies on a shared state cache ... expand Cache-conscious data placement Brad Calder, Chandra Krintz, Simmi John, Todd Austin Pages: 139 - 149 doi>10.1145/291069.291036 Full text: Pdf As the gap between memory and processor speeds continues to widen, cache eficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction cache pet$ormance by mapping code with temporal ... expand An out-of-order execution technique for runtime binary translators Bich C. Le Pages: 151 - 158 doi>10.1145/291069.291039 Full text: Pdf A dynamic translator emulates an instruction set architccturc by translating source instructions to native code during execution. On statically-scheduled hardware, higher performance can potentially be achieved by reordering the translated instructions; ... expand Overlapping execution with transfer using non-strict execution for mobile programs Chandra Krintz, Brad Calder, Han Bok Lee, Benjamin G. Zorn Pages: 159 - 169 doi>10.1145/291069.291040 Full text: Pdf In order to execute a program on a remote computer, it mustfirst be transferred over a network. This transmission incurs the over-head of network latency before execution can begin. This latency can vary greatly depending upon the size of the program., ... expand Variable length path branch prediction Jared Stark, Marius Evers, Yale N. Patt Pages: 170 - 179 doi>10.1145/291069.291042 Full text: Pdf Accurate branch prediction is required to achieve high performance in deeply pipelined, wide-issue processors. Recent studies have shown that conditional and indirect (or computed) branch targets can be accuratelypredicted by recording the path, which ... expand Performance isolation: sharing and isolation in shared-memory multiprocessors Ben Verghese, Anoop Gupta, Mendel Rosenblum Pages: 181 - 192 doi>10.1145/291069.291044 Full text: Pdf Shared-memory multiprocessors (SMPs) are being extensively used as general-purpose servers. The tight coupling of multiple processors, memory, and I/O provides enormous computing power in a single system, and enables the efficient sharing of these resources.The ... expand UTLB: a mechanism for address translation on network interfaces Yuqun Chen, Angelos Bilas, Stefanos N. Damianakis, Cezary Dubnicki, Kai Li Pages: 193 - 204 doi>10.1145/291069.291046 Full text: Pdf An important aspect of a high-speed network system is the ability to transfer data directly between the network interface and application buffers. Such a direct data path requires the network interface to "know" the virtual-to-physical address ... expand Locality-aware request distribution in cluster-based network servers Vivek S. Pai, Mohit Aron, Gaurov Banga, Michael Svendsen, Peter Druschel, Willy Zwaenepoel, Erich Nahum Pages: 205 - 216 doi>10.1145/291069.291048 Full text: Pdf We consider cluster-based network servers in which a front-end directs incoming requests to one of a number of back-ends. Specifically, we consider content-based request distribution: the front-end uses the content requested, in addition to information ... expand Investigating optimal local memory performance Olivier Temam Pages: 218 - 227 doi>10.1145/291069.291050 Full text: Pdf Recent work has demonstrated that, cache space is often poorly utilized. However, no previous work has yet demonstrated upper bounds on what a cache or local memory could achieve when exploiting both spatial and temporal locality. Belady's MIN algorithm ... expand Precise miss analysis for program transformations with caches of arbitrary associativity Somnath Ghosh, Margaret Martonosi, Sharad Malik Pages: 228 - 239 doi>10.1145/291069.291051 Full text: Pdf Analyzing and optimizing program memory performance is a pressing problem in high-performance computer architectures. Currently, software solutions addressing the processor-memory performance gap include compiler-or programmer-applied optimizations like ... expand Capturing dynamic memory reference behavior with adaptive cache topology Jih-Kwon Peir, Yongjoon Lee, Windsor W. Hsu Pages: 240 - 250 doi>10.1145/291069.291053 Full text: Pdf Memory references exhibit locality and are therefore not uniformly distributed across the sets of a cache. This skew reduces the effectiveness of a cache because it results in the caching of a considerable number of less-recently-used lines which are ... expand Accelerating multi-media processing by implementing memoing in multiplication and division units Daniel Citron, Dror Feitelson, Larry Rudolph Pages: 252 - 261 doi>10.1145/291069.291056 Full text: Pdf This paper proposes a technique that enables performing multi-cycle (multiplication, division, square-root …) computations in a single cycle. The technique is based on the notion of memoing: saving the input and output of previous calculations ... expand Value speculation scheduling for high performance processors Chao-Ying Fu, Matthew D. Jennings, Sergei Y. Larin, Thomas M. Conte Pages: 262 - 271 doi>10.1145/291069.291058 Full text: Pdf Recent research in value prediction shows a surprising amount of predictability for the values produced by register-writing instructions. Several hardware based value predictor designs have been proposed to exploit this predictability by eliminating ... expand An empirical study of decentralized ILP execution models Narayan Ranganathan, Manoj Franklin Pages: 272 - 281 doi>10.1145/291069.291061 Full text: Pdf Recent fascination for dynamic scheduling as a means for exploiting instruction-level parallelism has introduced significant interest in the scalability aspects of dynamic scheduling hardware. In order to overcome the scalability problems of centralized ... expand Fast out-of-order processor simulation using memoization Eric Schnarr, James R. Larus Pages: 283 - 294 doi>10.1145/291069.291063 Full text: Pdf Our new out-of-order processor simulatol; FastSim, uses two innovations to speed up simulation 8--15 times (vs. Wisconsin SimpleScalar) with no loss in simulation accuracy. First, FastSim uses speculative direct-execution to accelerate the functional ... expand A look at several memory management units, TLB-refill mechanisms, and page table organizations Bruce L. Jacob, Trevor N. Mudge Pages: 295 - 306 doi>10.1145/291069.291065 Full text: Pdf Virtual memory is a staple in modem systems, though there is little agreement on how its functionality is to be implemented on either the hardware or software side of the interface. The myriad of design choices and incompatible hardware mechanisms suggests ... expand Performance of database workloads on shared-memory systems with out-of-order processors Parthasarathy Ranganathan, Kourosh Gharachorloo, Sarita V. Adve, Luiz AndrŽ Barroso Pages: 307 - 318 doi>10.1145/291069.291067 Full text: Pdf Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized ... expand 2000 Designing computer systems with MEMS-based storage Steven W. Schlosser, John Linwood Griffin, David F. Nagle, Gregory R. Ganger Pages: 1 - 12 doi>10.1145/378993.378996 Full text: Pdf For decades the RAM-to-disk memory hierarchy gap has plagued computer architects. An exciting new storage technology based on microelectromechanical systems (MEMS) is poised to fill a large portion of this performance gap, significantly reduce system ... expand Architecture and design of AlphaServer GS320 Kourosh Gharachorloo, Madhu Sharma, Simon Steely, Stephen Van Doren Pages: 13 - 24 doi>10.1145/378993.378997 Full text: Pdf This paper describes the architecture and implementation of the AlphaServer GS320, a cache-coherent non-uniform memory access multiprocessor developed at Compaq. The AlphaServer GS320 architecture is specifically targeted at medium-scale multiprocessing ... expand Timestamp snooping: an approach for extending SMPs Milo M. K. Martin, Daniel J. Sorin, Anatassia Ailamaki, Alaa R. Alameldeen, Ross M. Dickson, Carl J. Mauer, Kevin E. Moore, Manoj Plakal, Mark D. Hill, David A. Wood Pages: 25 - 36 doi>10.1145/378993.378998 Full text: Pdf Symmetric muultiprocessor (SMP) servers provide superior performance for the commercial workloads that dominate the Internet. Our simulation results show that over one-third of cache misses by these applications result in cache-to-cache transfers, where ... expand MemorIES3: a programmable, real-time hardware emulation tool for multiprocessor server design Ashwini Nanda, Kwok-Ken Mak, Krishnan Sugarvanam, Ramendra K. Sahoo, Vijayaraghavan Soundarararjan, T. Basil Smith Pages: 37 - 48 doi>10.1145/378993.378999 Full text: Pdf Modern system design often requires multiple levels of simulation for design validation and performance debugging. However, while machines have gotten faster, and simulators have become more detailed, simulation speeds have not tracked machine speeds, ... expand FLASH vs. (Simulated) FLASH: closing the simulation loop Jeff Gibson, Robert Kunz, David Ofelt, Mark Horowitz, John Hennessy, Mark Heinrich Pages: 49 - 58 doi>10.1145/378993.379000 Full text: Pdf Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and quantifying the resulting simulator error can be difficult. ... expand Using meta-level compilation to check FLASH protocol code Andy Chou, Benjamin Chelf, Dawson Engler, Mark Heinrich Pages: 59 - 70 doi>10.1145/378993.379002 Full text: Pdf Building systems such as OS kernels and embedded software is difficult. An important source of this difficulty is the numerous rules they must obey: interrupts cannot be disabled for ~too long," global variables must be protected by locks, user pointers ... expand Evaluating design alternatives for reliable communication on high-speed networks Raoul A. F. Bhoedjang, Kees Verstoep, Tim RŸhl, Henri E. Bal, Rutger F. H. Hofman Pages: 71 - 81 doi>10.1145/378993.379004 Full text: Pdf We systematically evaluate the performance of five implementations of a single, user-level communication interface. Each implementation makes different architectural assumptions about the reliability of the network hardware and the capabilities of the ... expand Communication scheduling Peter Mattson, William J. Dally, Scott Rixner, Ujval J. Kapasi, John D. Owens Pages: 82 - 92 doi>10.1145/378993.379005 Full text: Pdf The high arithmetic rates of media processing applications require architectures with tens to hundreds of functional units, multiple register files, and explicit interconnect between functional units and register files. Communication scheduling enables ... expand System architecture directions for networked sensors Jason Hill, Robert Szewczyk, Alec Woo, Seth Hollar, David Culler, Kristofer Pister Pages: 93 - 104 doi>10.1145/378993.379006 Full text: Pdf Technological progress in integrated, low-power, CMOS communication devices and sensors makes a rich design space of networked sensors viable. They can be deeply embedded in the physical world and spread throughout our environment like smart dust. The ... expand Power aware page allocation Alvin R. Lebeck, Xiaobo Fan, Heng Zeng, Carla Ellis Pages: 105 - 116 doi>10.1145/378993.379007 Full text: Pdf One of the major challenges of post-PC computing is the need to reduce energy consumption, thereby extending the lifetime of the batteries that power these mobile devices. Memory is a particularly important target for efforts to improve energy efficiency. ... expand Hoard: a scalable memory allocator for multithreaded applications Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, Paul R. Wilson Pages: 117 - 128 doi>10.1145/378993.379232 Full text: Pdf Parallel, multithreaded C and C++ programs such as web servers, database managers, news servers, and scientific applications are becoming increasingly prevalent. For these applications, the memory allocator is often a bottleneck that severely limits ... expand Thread-level parallelism and interactive performance of desktop applications Kristi‡n Flautner, Rich Uhlig, Steve Reinhardt, Trevor Mudge Pages: 129 - 138 doi>10.1145/378993.379233 Full text: Pdf Multiprocessing is already prevalent in servers where multiple clients present an obvious source of thread-level parallelism. However, the case for multiprocessing is less clear for desktop applications. Nevertheless, architects are designing processors ... expand Effective null pointer check elimination utilizing hardware trap Motohiro Kawahito, Hideaki Komatsu, Toshio Nakatani Pages: 139 - 149 doi>10.1145/378993.379234 Full text: Pdf We present a new algorithm for eliminating null pointer checks from programs written in Java™. Our new algorithm is split into two phases. In the first phase, it moves null checks backward, and it is iterated for a few times with other optimizations ... expand Frequent value locality and value-centric data cache design Youtao Zhang, Jun Yang, Rajiv Gupta Pages: 150 - 159 doi>10.1145/378993.379235 Full text: Pdf By studying the behavior of programs in the SPECint95 suite we observed that six out of eight programs exhibit a new kind of value locality, the frequent value locality, according to which a few values appear very frequently in memory locations ... expand Efficient and flexible value sampling M. Burrows, U. Erlingsson, S-T. A. Leung, M. T. Vandevoorde, C. A. Waldspurger, K. Walker, W. E. Weihl Pages: 160 - 167 doi>10.1145/378993.379236 Full text: Pdf This paper presents novel sampling-based techniques for collecting statistical profiles of register contents, data values, and other information associated with instructions, such as memory latencies. Values of interest are sampled in response to periodic ... expand Architectural support for copy and tamper resistant software David Lie Chandramohan Thekkath, Mark Mitchell, Patrick Lincoln, Dan Boneh, John Mitchell, Mark Horowitz Pages: 168 - 177 doi>10.1145/378993.379237 Full text: Pdf Although there have been attempts to develop code transformations that yield tamper-resistant software, no reliable software-only methods are know. This paper studies the hardware implementation of a form of execute-only memory (XOM) that allows instructions ... expand Architectural support for fast symmetric-key cryptography Jerome Burke, John McDonald, Todd Austin Pages: 178 - 189 doi>10.1145/378993.379238 Full text: Pdf The emergence of the Internet as a trusted medium for commerce and communication has made cryptography an essential component of modern information systems. Cryptography provides the mechanisms necessary to implement accountability, accuracy, and confidentiality ... expand OceanStore: an architecture for global-scale persistent storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Chris Wells, Ben Zhao Pages: 190 - 201 doi>10.1145/378993.379239 Full text: Pdf OceanStore is a utility infrastructure designed to span the globe and provide continuous access to persistent information. Since this infrastructure is comprised of untrusted servers, data is protected through redundancy and cryptographic techniques. ... expand Software profiling for hot path prediction: less is more Evelyn Duesterwald, Vasanth Bala Pages: 202 - 211 doi>10.1145/378993.379241 Full text: Pdf Recently, there has been a growing interest in exploiting profile information in adaptive systems such as just-in-time compilers, dynamic optimizers and, binary translators. In this paper, we show that sophisticated software profiling schemes that provide ... expand OS and compiler considerations in the design of the IA-64 architecture Rumi Zahir, Jonathan Ross, Dale Morris, Drew Hess Pages: 212 - 221 doi>10.1145/378993.379242 Full text: Pdf Increasing demands for processor performance have outstripped the pace of process and frequency improvements, pushing designers to find ways of increasing the amount of work that can be processed in parallel. Traditional RISC architectures use hardware ... expand Hardware support for dynamic activation of compiler-directed computation reuse Daniel A. Connors, Hillery C. Hunter, Ben-Chung Cheng, Wen-mei W. Hwu Pages: 222 - 233 doi>10.1145/378993.379243 Full text: Pdf Compiler-directed Computation Reuse (CCR) enhances program execution speed and efficiency by eliminating dynamic computation redundancy. In this approach, the compiler designates large program regions for potential reuse. During run time, the execution ... expand Symbiotic jobscheduling for a simultaneous multithreaded processor Allan Snavely, Dean M. Tullsen Pages: 234 - 244 doi>10.1145/378993.379244 Full text: Pdf Simultaneous Multithreading machines fetch and execute instructions from multiple instruction streams to increase system utilization and speedup the execution of jobs. When there are more jobs in the system than there is hardware to support simultaneous ... expand An analysis of operating system behavior on a simultaneous multithreaded architecture Joshua A. Redstone, Susan J. Eggers, Henry M. Levy Pages: 245 - 256 doi>10.1145/378993.379245 Full text: Pdf This paper presents the first analysis of operating system execution on a simultaneous multithreaded (SMT) processor. While SMT has been studied extensively over the past 6 years, previous research has focused entirely on user-mode execution. However, ... expand Slipstream processors: improving both performance and fault tolerance Karthik Sundaramoorthy, Zach Purser, Eric Rotenburg Pages: 257 - 268 doi>10.1145/378993.379247 Full text: Pdf Processors execute the full dynamic instruction stream to arrive at the final output of a program, yet there exist shorter instruction streams that produce the same overall effect. We propose creating a shorter but otherwise equivalent version of the ... expand 2002 Keynote address: Sensor network research: emerging challenges for architecture, systems, and languages Deborah Estrin Pages: 1 - 4 doi>10.1145/605397.1090192 SESSION: Multiprocessor synchronization and speculation Transactional lock-free execution of lock-based programs Ravi Rajwar, James R. Goodman Pages: 5 - 17 doi>10.1145/605397.605399 Full text: Pdf This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multi-threaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access ... expand Speculative synchronization: applying thread-level speculation to explicitly parallel applications JosŽ F. Mart’nez, Josep Torrellas Pages: 18 - 29 doi>10.1145/605397.605400 Full text: Pdf Barriers, locks, and flags are synchronizing operations widely used programmers and parallelizing compilers to produce race-free parallel programs. Often times, these operations are placed suboptimally, either because of conservative assumptions about ... expand Temporally silent stores Kevin M. Lepak, Mikko H. Lipasti Pages: 30 - 41 doi>10.1145/605397.605401 Full text: Pdf Recent work has shown that silent stores--stores which write a value matching the one already stored at the memory location--occur quite frequently and can be exploited to reduce memory traffic and improve performance. This paper extends the definition ... expand SESSION: System performance and optimization Automatically characterizing large scale program behavior Timothy Sherwood, Erez Perelman, Greg Hamerly, Brad Calder Pages: 45 - 57 doi>10.1145/605397.605403 Full text: Pdf Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the very largest of scales (over the complete execution of the program). This realization has ... expand Bytecode fetch optimization for a Java interpreter Kazunori Ogata, Hideaki Komatsu, Toshio Nakatani Pages: 58 - 67 doi>10.1145/605397.605404 Full text: Pdf Interpreters play an important role in many languages, and their performance is critical particularly for the popular language Java. The performance of the interpreter is important even for high-performance virtual machines that employ just-in-time compiler ... expand Understanding and improving operating system effects in control flow prediction Tao Li, Lizy Kurian John, Anand Sivasubramaniam, N. Vijaykrishnan, Juan Rubio Pages: 68 - 80 doi>10.1145/605397.605405 Full text: Pdf Many modern applications result in a significant operating system (OS) component. The OS component has several implications including affecting the control flow transfer in the execution environment. This paper focuses on understanding the operating ... expand SESSION: Emerging systems MatŽ: a tiny virtual machine for sensor networks Philip Levis, David Culler Pages: 85 - 95 doi>10.1145/605397.605407 Full text: Pdf Composed of tens of thousands of tiny devices with very limited resources ("motes"), sensor networks are subject to novel systems problems and constraints. The large number of motes in a sensor network means that there will often be some failing nodes; ... expand Energy-efficient computing for wildlife tracking: design tradeoffs and early experiences with ZebraNet Philo Juang, Hidekazu Oki, Yong Wang, Margaret Martonosi, Li Shiuan Peh, Daniel Rubenstein Pages: 96 - 107 doi>10.1145/605397.605408 Full text: Pdf Over the past decade, mobile computing and wireless communication have become increasingly important drivers of many new computing applications. The field of wireless sensor networks particularly focuses on applications involving autonomous use of compute, ... expand Enabling trusted software integrity Darko Kirovski, Milenko Drini?, Miodrag Potkonjak Pages: 108 - 120 doi>10.1145/605397.605409 Full text: Pdf Preventing execution of unauthorized software on a given computer plays a pivotal role in system security. The key problem is that although a program at the beginning of its execution can be verified as authentic, while running, its execution flow can ... expand SESSION: Energy efficient systems ECOSystem: managing energy as a first class operating system resource Heng Zeng, Carla S. Ellis, Alvin R. Lebeck, Amin Vahdat Pages: 123 - 132 doi>10.1145/605397.605411 Full text: Pdf Energy consumption has recently been widely recognized as a major challenge of computer systems design. This paper explores how to support energy as a first-class operating system resource. Energy, because of its global system nature, presents challenges ... expand Cool-Mem: combining statically speculative memory accessing with selective address translation for energy efficiency Raksit Ashok, Saurabh Chheda, Csaba Andras Moritz Pages: 133 - 143 doi>10.1145/605397.605412 Full text: Pdf This paper presents Cool-Mem, a family of memory system architectures that integrate conventional memory system mechanisms, energy-aware address translation, and compiler-enabled cache disambiguation techniques, to reduce energy consumption in general ... expand Joint local and global hardware adaptations for energy Ruchira Sasanka, Christopher J. Hughes, Sarita V. Adve Pages: 144 - 155 doi>10.1145/605397.605413 Full text: Pdf This work concerns algorithms to control energy-driven architecture adaptations for multimedia applications, without and with dynamic voltage scaling (DVS). We identify a broad design space for adaptation control algorithms based on two attributes: (1) ... expand SESSION: Speculative threads Design and evaluation of compiler algorithms for pre-execution Dongkeun Kim, Donald Yeung Pages: 159 - 170 doi>10.1145/605397.605415 Full text: Pdf Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of ... expand Compiler optimization of scalar value communication between speculative threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan, Todd C. Mowry Pages: 171 - 183 doi>10.1145/605397.605416 Full text: Pdf While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In ... expand Enhancing software reliability with speculative threads Jeffrey Oplinger, Monica S. Lam Pages: 184 - 196 doi>10.1145/605397.605417 Full text: Pdf This paper advocates the use of a monitor-and-recover programming paradigm to enhance the reliability of software, and proposes an architectural design that allows software and hardware to cooperate in making this paradigm more efficient and easier to ... expand SESSION: Computer architecture Dynamic dead-instruction detection and elimination J. Adam Butts, Guri Sohi Pages: 199 - 210 doi>10.1145/605397.605419 Full text: Pdf We observe a non-negligible fraction--3 to 16% in our benchmarks--of dynamically dead instructions, dynamic instruction instances that generate unused results. The majority of these instructions arise from static instructions that also produce ... expand An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches Changkyu Kim, Doug Burger, Stephen W. Keckler Pages: 211 - 222 doi>10.1145/605397.605420 Full text: Pdf Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the ... expand A comparative study of arbitration algorithms for the Alpha 21364 pipelined router Shubhendu S. Mukherjee, Federico Silla, Peter Bannon, Joel Emer, Steve Lang, David Webb Pages: 223 - 234 doi>10.1145/605397.605421 Full text: Pdf Interconnection networks usually consist of a fabric of interconnected routers, which receive packets arriving at their input ports and forward them to appropriate output ports. Unfortunately, network packets moving through these routers are often delayed ... expand SESSION: Communication abstractions and optimizations Increasing web server throughput with network interface data caching Hyong-youb Kim, Vijay S. Pai, Scott Rixner Pages: 239 - 250 doi>10.1145/605397.605423 Full text: Pdf This paper introduces network interface data caching, a new technique to reduce local interconnect traffic on networking servers by caching frequently-requested content on a programmable network interface. The operating system on the host CPU determines ... expand Programming language optimizations for modular router configurations Eddie Kohler, Robert Morris, Benjie Chen Pages: 251 - 263 doi>10.1145/605397.605424 Full text: Pdf Networking systems such as Ensemble, the x-kernel, Scout, and Click achieve flexibility by building routers and other packet processors from modular components. Unfortunately, component designs are often slower than purpose-built code, and routers ... expand Evolving RPC for active storage Muthian Sivathanu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Pages: 264 - 276 doi>10.1145/605397.605425 Full text: Pdf We introduce Scriptable RPC (SRPC), an RPC-based framework that enables distributed system services to take advantage of active components. Technology trends point to a world where each component in a system (whether disk, network interface, or memory) ... expand SESSION: Coordinating memory A stateless, content-directed data prefetching mechanism Robert Cooksey, Stephan Jourdan, Dirk Grunwald Pages: 279 - 290 doi>10.1145/605397.605427 Full text: Pdf Although central processor speeds continues to improve, improvements in overall system performance are increasingly hampered by memory latency, especially for pointer-intensive applications. To counter this loss of performance, numerous data and instruction ... expand A stream compiler for communication-exposed architectures Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe Pages: 291 - 303 doi>10.1145/605397.605428 Full text: Pdf With the increasing miniaturization of transistors, wire delays are becoming a dominant factor in microprocessor performance. To address this issue, a number of emerging architectures contain replicated processing units with software-exposed communication ... expand Mondrian memory protection Emmett Witchel, Josh Cates, Krste Asanovi? Pages: 304 - 316 doi>10.1145/605397.605429 Full text: Pdf Mondrian memory protection (MMP) is a fine-grained protection scheme that allows multiple protection domains to flexibly share memory and export protected services. In contrast to earlier page-based systems, MMP allows arbitrary permissions control at ... expand 2004 Programming with transactional coherence and consistency (TCC) Lance Hammond, Brian D. Carlstrom, Vicky Wong, Ben Hertzberg, Mike Chen, Christos Kozyrakis, Kunle Olukotun Pages: 1 - 13 doi>10.1145/1024393.1024395 Full text: Pdf Transactional Coherence and Consistency (TCC) offers a way to simplify parallel programming by executing all code within transactions. In TCC systems, transactions serve as the fundamental unit of parallel work, communication and coherence. As each transaction ... expand Spatial computation Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Pages: 14 - 26 doi>10.1145/1024393.1024396 Full text: Pdf This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized ... expand An ultra low-power processor for sensor networks Virantha Ekanayake, Clinton Kelly, IV, Rajit Manohar Pages: 27 - 36 doi>10.1145/1024393.1024397 Full text: Pdf We present a novel processor architecture designed specifically for use in low-power wireless sensor-network nodes. Our sensor network asynchronous processor (SNAP/LE) is based on an asynchronous data-driven 16-bit RISC core with an extremely low-power ... expand SESSION: Storage D-SPTF: decentralized request distribution in brick-based storage systems Christopher R. Lumb, Richard Golding Pages: 37 - 47 doi>10.1145/1024393.1024399 Full text: Pdf Distributed Shortest-Positioning Time First (D-SPTF) is a request distribution protocol for decentralized systems of storage servers. D-SPTF exploits high-speed interconnects to dynamically select which server, among those with a replica, should service ... expand FAB: building distributed enterprise disk arrays from commodity components Yasushi Saito, Svend Fr¿lund, Alistair Veitch, Arif Merchant, Susan Spence Pages: 48 - 58 doi>10.1145/1024393.1024400 Full text: Pdf This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a ... expand Deconstructing storage arrays Timothy E. Denehy, John Bent, Florentina I. Popovici, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Pages: 59 - 71 doi>10.1145/1024393.1024401 Full text: Pdf We introduce Shear, a user-level software tool that characterizes RAID storage arrays. Shear employs a set of controlled algorithms combined with statistical techniques to automatically determine the important properties of a RAID system, including the ... expand SESSION: Security HIDE: an infrastructure for efficiently protecting information leakage on the address bus Xiaotong Zhuang, Tao Zhang, Santosh Pande Pages: 72 - 84 doi>10.1145/1024393.1024403 Full text: Pdf XOM-based secure processor has recently been introduced as a mechanism to provide copy and tamper resistant execution. XOM provides support for encryption/decryption and integrity checking. However, neither XOM nor any other current approach adequately ... expand Secure program execution via dynamic information flow tracking G. Edward Suh, Jae W. Lee, David Zhang, Srinivas Devadas Pages: 85 - 96 doi>10.1145/1024393.1024404 Full text: Pdf We present a simple architectural mechanism called dynamic information flow tracking that can significantly improve the security of computing systems with negligible performance overhead. Dynamic information flow tracking protects programs against malicious ... expand SESSION: Architecture Coherence decoupling: making use of incoherence Jaehyuk Huh, Jichuan Chang, Doug Burger, Gurindar S. Sohi Pages: 97 - 106 doi>10.1145/1024393.1024406 Full text: Pdf This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces ... expand Continual flow pipelines Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton Pages: 107 - 119 doi>10.1145/1024393.1024407 Full text: Pdf Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides ... expand Scalable selective re-execution for EDGE architectures Rajagopalan Desikan, Simha Sethumadhavan, Doug Burger, Stephen W. Keckler Pages: 120 - 132 doi>10.1145/1024393.1024408 Full text: Pdf Pipeline flushes are becoming increasingly expensive in modern microprocessors with large instruction windows and deep pipelines. Selective re-execution is a technique that can reduce the penalty of mis-speculations by re-executing only instructions ... expand SESSION: Potpourri HOIST: a system for automatically deriving static analyzers for embedded systems John Regehr, Alastair Reid Pages: 133 - 143 doi>10.1145/1024393.1024410 Full text: Pdf Embedded software must meet conflicting requirements such as be-ing highly reliable, running on resource-constrained platforms, and being developed rapidly. Static program analysis can help meet all of these goals. People developing analyzers for embedded ... expand Helper threads via virtual multithreading on an experimental itanium¨ 2 processor-based platform Perry H. Wang, Jamison D. Collins, Hong Wang, Dongkeun Kim, Bill Greene, Kai-Ming Chan, Aamir B. Yunus, Terry Sych, Stephen F. Moore, John P. Shen Pages: 144 - 155 doi>10.1145/1024393.1024411 Full text: Pdf Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads ... expand Low-overhead memory leak detection using adaptive statistical profiling Matthias Hauswirth, Trishul M. Chilimbi Pages: 156 - 164 doi>10.1145/1024393.1024412 Full text: Pdf Sampling has been successfully used to identify performance optimization opportunities. We would like to apply similar techniques to check program correctness. Unfortunately, sampling provides poor coverage of infrequently executed code, where bugs often ... expand SESSION: Memory system analysis and optimization Locality phase prediction Xipeng Shen, Yutao Zhong, Chen Ding Pages: 165 - 176 doi>10.1145/1024393.1024414 Full text: Pdf As computer memory hierarchy becomes adaptive, its performance increasingly depends on forecasting the dynamic program locality. This paper presents a method that predicts the locality phases of a program by a combination of locality profiling and run-time ... expand Dynamic tracking of page miss ratio curve for memory management Pin Zhou, Vivek Pandey, Jagadeesan Sundaresan, Anand Raghuraman, Yuanyuan Zhou, Sanjeev Kumar Pages: 177 - 188 doi>10.1145/1024393.1024415 Full text: Pdf Memory can be efficiently utilized if the dynamic memory demands of applications can be determined and analyzed at run-time. The page miss ratio curve(MRC), i.e. page miss rate vs. memory size curve, is a good performance-directed metric to serve this ... expand Compiler orchestrated prefetching via speculation and predication Rodric M. Rabbah, Hariharan Sandanagobalane, Mongkol Ekpanyapong, Weng-Fai Wong Pages: 189 - 198 doi>10.1145/1024393.1024416 Full text: Pdf This paper introduces a compiler orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program ... expand Software prefetching for mark-sweep garbage collection: hardware analysis and software redesign Chen-Yong Cher, Antony L. Hosking, T. N. Vijaykumar Pages: 199 - 210 doi>10.1145/1024393.1024417 Full text: Pdf Tracing garbage collectors traverse references from live program variables, transitively tracing out the closure of live objects. Memory accesses incurred during tracing are essentially random: a given object may contain references to any other object. ... expand SESSION: Reliability Devirtualizable virtual machines enabling general, single-node, online maintenance David E. Lowell, Yasushi Saito, Eileen J. Samberg Pages: 211 - 223 doi>10.1145/1024393.1024419 Full text: Pdf Maintenance is the dominant source of downtime at high availability sites. Unfortunately, the dominant mechanism for reducing this downtime, cluster rolling upgrade, has two shortcomings that have prevented its broad acceptance. First, cluster-style ... expand Fingerprinting: bounding soft-error detection latency and bandwidth Jared C. Smolens, Brian T. Gold, Jangwoo Kim, Babak Falsafi, James C. Hoe, Andreas G. Nowatzyk Pages: 224 - 234 doi>10.1145/1024393.1024420 Full text: Pdf Recent studies have suggested that the soft-error rate in microprocessor logic will become a reliability concern by 2010. This paper proposes an efficient error detection technique, called fingerprinting, that detects differences in execution ... expand Application-level checkpointing for shared memory programs Greg Bronevetsky, Daniel Marques, Keshav Pingali, Peter Szwed, Martin Schulz Pages: 235 - 247 doi>10.1145/1024393.1024421 Full text: Pdf Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and ... expand SESSION: Power Formal online methods for voltage/frequency control in multiple clock domain microprocessors Qiang Wu, Philo Juang, Margaret Martonosi, Douglas W. Clark Pages: 248 - 259 doi>10.1145/1024393.1024423 Full text: Pdf Multiple Clock Domain (MCD) processors are a promising future alternative to today's fully synchronous designs. Dynamic Voltage and Frequency Scaling (DVFS) in an MCD processor has the extra flexibility to adjust the voltage and frequency in each domain ... expand Heat-and-run: leveraging SMT and CMP to manage power density through the operating system Mohamed Gomaa, Michael D. Powell, T. N. Vijaykumar Pages: 260 - 270 doi>10.1145/1024393.1024424 Full text: Pdf Power density in high-performance processors continues to increase with technology generations as scaling of current, clock speed, and device density outpaces the downscaling of supply voltage and thermal ability of packages to dissipate heat. Power ... expand Performance directed energy management for main memory and disks Xiaodong Li, Zhenmin Li, Francis David, Pin Zhou, Yuanyuan Zhou, Sarita Adve, Sanjeev Kumar Pages: 271 - 283 doi>10.1145/1024393.1024425 Full text: Pdf Much research has been conducted on energy management for memory and disks. Most studies use control algorithms that dynamically transition devices to low power modes after they are idle for a certain threshold period of time. The control algorithms ... expand 2006 Impact of virtualization on computer architecture and operating systems Mendel Rosenblum Pages: 1 - 1 doi>10.1145/1168857.1168858 Full text: Pdf Abstract This talk describes how virtualization is changing the way computing is done in the industry today and how it is causing users to rethink how they view hardware, operating systems, and application programs. The talk will describe this new view ... expand SESSION: Virtualization A comparison of software and hardware techniques for x86 virtualization Keith Adams, Ole Agesen Pages: 2 - 13 doi>10.1145/1168857.1168860 Full text: Pdf Until recently, the x86 architecture has not permitted classical trap-and-emulate virtualization. Virtual Machine Monitors for x86, such as VMware ¨ Workstation and Virtual PC, have instead used binary translation of the guest kernel code. However, ... expand Geiger: monitoring the buffer cache in a virtual machine environment Stephen T. Jones, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Pages: 14 - 24 doi>10.1145/1168857.1168861 Full text: Pdf Virtualization is increasingly being used to address server management and administration issues like flexible resource allocation, service isolation and workload migration. In a virtualized environment, the virtual machine monitor (VMM) is the primary ... expand Temporal search: detecting hidden malware timebombs with virtual machines Jedidiah R. Crandall, Gary Wassermann, Daniela A. S. de Oliveira, Zhendong Su, S. Felix Wu, Frederic T. Chong Pages: 25 - 36 doi>10.1145/1168857.1168862 Full text: Pdf Worms, viruses, and other malware can be ticking bombs counting down to a specific time, when they might, for example, delete files or download new instructions from a public web server. We propose a novel virtual-machine-based analysis technique to ... expand SESSION: Races and memory debugging I AVIO: detecting atomicity violations via access interleaving invariants Shan Lu, Joseph Tucek, Feng Qin, Yuanyuan Zhou Pages: 37 - 48 doi>10.1145/1168857.1168864 Full text: Pdf Concurrency bugs are among the most difficult to test and diagnose of all software bugs. The multicore technology trend worsens this problem. Most previous concurrency bug detection work focuses on one bug subclass, data races, and neglects many other ... expand A regulated transitive reduction (RTR) for longer memory race recording Min Xu, Mark D. Hill, Rastislav Bodik Pages: 49 - 60 doi>10.1145/1168857.1168865 Full text: Pdf Now at VMware. Multithreaded deterministic replay has important applications in cyclic debugging, fault tolerance and intrusion analysis. Memory race recording is a key technology for multithreaded deterministic replay. In this paper, we considerably ... expand Bell: bit-encoding online memory leak detection Michael D. Bond, Kathryn S. McKinley Pages: 61 - 72 doi>10.1145/1168857.1168866 Full text: Pdf Memory leaks compromise availability and security by crippling performance and crashing programs. Leaks are difficult to diagnose because they have no immediate symptoms. Online leak detection tools benefit from storing and reporting per-object sites ... expand SESSION: Hardware reliability and fault tolerance Ultra low-cost defect protection for microprocessor pipelines Smitha Shyam, Kypros Constantinides, Sujay Phadke, Valeria Bertacco, Todd Austin Pages: 73 - 82 doi>10.1145/1168857.1168868 Full text: Pdf The sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing defects, ... expand Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance Vimal K. Reddy, Eric Rotenberg, Sailashri Parthasarathy Pages: 83 - 94 doi>10.1145/1168857.1168869 Full text: Pdf Redundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance duplicates ... expand SlicK: slice-based locality exploitation for efficient redundant multithreading Angshuman Parashar, Anand Sivasubramaniam, Sudhanva Gurumurthi Pages: 95 - 105 doi>10.1145/1168857.1168870 Full text: Pdf Transient faults are expected a be a major design consideration in future microprocessors. Recent proposals for transient fault detection in processor cores have revolved around the idea of redundant threading, which involves redundant execution of a ... expand SESSION: Energy efficiency Mercury and freon: temperature emulation and management for server systems Taliver Heath, Ana Paula Centeno, Pradeep George, Luiz Ramos, Yogesh Jaluria, Ricardo Bianchini Pages: 106 - 116 doi>10.1145/1168857.1168872 Full text: Pdf Power densities have been increasing rapidly at all levels of server systems. To counter the high temperatures resulting from these densities, systems researchers have recently started work on softwarebased thermal management. Unfortunately, research ... expand PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor Taeho Kgil, Shaun D'Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski, Trevor Mudge, Steven Reinhardt, Krisztian Flautner Pages: 117 - 128 doi>10.1145/1168857.1168873 Full text: Pdf In this paper, we show how 3D stacking technology can be used to implement a simple, low-power, high-performance chip multiprocessor suitable for throughput processing. Our proposed architecture, PicoServer, employs 3D technology to bond one die containing ... expand SESSION: Scheduling and spatial programming A spatial path scheduling algorithm for EDGE architectures Katherine E. Coons, Xia Chen, Doug Burger, Kathryn S. McKinley, Sundeep K. Kushwaha Pages: 129 - 140 doi>10.1145/1168857.1168875 Full text: Pdf Growing on-chip wire delays are motivating architectural features that expose on-chip communication to the compiler. EDGE architectures are one example of communication-exposed microarchitectures in which the compiler forms dataflow graphs that specify ... expand Instruction scheduling for a tiled dataflow architecture Martha Mercaldi, Steven Swanson, Andrew Petersen, Andrew Putnam, Andrew Schwerin, Mark Oskin, Susan J. Eggers Pages: 141 - 150 doi>10.1145/1168857.1168876 Full text: Pdf This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned ... expand Exploiting coarse-grained task, data, and pipeline parallelism in stream programs Michael I. Gordon, William Thies, Saman Amarasinghe Pages: 151 - 162 doi>10.1145/1168857.1168877 Full text: Pdf As multicore architectures enter the mainstream, there is a pressing demand for high-level programming models that can effectively map to them. Stream programming offers an attractive way to expose coarse-grained parallelism, as streaming applications ... expand Tartan: evaluating spatial computation for whole program execution Mahim Mishra, Timothy J. Callahan, Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein, Mihai Budiu Pages: 163 - 174 doi>10.1145/1168857.1168878 Full text: Pdf Spatial Computing (SC) has been shown to be an energy-efficient model for implementing program kernels. In this paper we explore the feasibility of using SC for more than small kernels. To this end, we evaluate the performance and energy efficiency of ... expand SESSION: Estimation and prediction of power and performance A performance counter architecture for computing accurate CPI components Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, James E. Smith Pages: 175 - 184 doi>10.1145/1168857.1168880 Full text: Pdf A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into ... expand Accurate and efficient regression modeling for microarchitectural performance and power prediction Benjamin C. Lee, David M. Brooks Pages: 185 - 194 doi>10.1145/1168857.1168881 Full text: Pdf We propose regression modeling as an efficient approach for accurately predicting performance and power for various applications executing on any microprocessor configuration in a large microarchitectural design space. This paper addresses fundamental ... expand Efficiently exploring architectural design spaces via predictive modeling Engin ìpek, Sally A. McKee, Rich Caruana, Bronis R. de Supinski, Martin Schulz Pages: 195 - 206 doi>10.1145/1168857.1168882 Full text: Pdf Architects use cycle-by-cycle simulation to evaluate design choices and understand tradeoffs and interactions among design parameters. Efficiently exploring exponential-size design spaces with many interacting parameters remains an open problem: the ... expand SESSION: Races and memory debugging II Comprehensively and efficiently protecting the heap Mazen Kharbutli, Xiaowei Jiang, Yan Solihin, Guru Venkataramani, Milos Prvulovic Pages: 207 - 218 doi>10.1145/1168857.1168884 Full text: Pdf The goal of this paper is to propose a scheme that provides comprehensive security protection for the heap. Heap vulnerabilities are increasingly being exploited for attacks on computer programs. In most implementations, the heap management library keeps ... expand HeapMD: identifying heap-based bugs using anomaly detection Trishul M. Chilimbi, Vinod Ganapathy Pages: 219 - 228 doi>10.1145/1168857.1168885 Full text: Pdf We present the design, implementation, and evaluation of HeapMD, a dynamic analysis tool that finds heap-based bugs using anomaly detection. HeapMD is based upon the observation that, in spite of the evolving nature of the heap, several of its properties ... expand Recording shared memory dependencies using strata Satish Narayanasamy, Cristiano Pereira, Brad Calder Pages: 229 - 240 doi>10.1145/1168857.1168886 Full text: Pdf Significant time is spent by companies trying to reproduce and fix bugs. BugNet and FDR are recent architecture proposals that provide architecture support for deterministic replay debugging. They focus on continuously recording information about the ... expand SESSION: Emerging technologies A defect tolerant self-organizing nanoscale SIMD architecture Jaidev P. Patwardhan, Vijeta Johri, Chris Dwyer, Alvin R. Lebeck Pages: 241 - 251 doi>10.1145/1168857.1168888 Full text: Pdf The continual decrease in transistor size (through either scaled CMOS or emerging nano-technologies) promises to usher in an era of tera to peta-scale integration. However, this decrease in size is also likely to increase defect densities, contributing ... expand A program transformation and architecture support for quantum uncomputation Ethan Schuchman, T. N. Vijaykumar Pages: 252 - 263 doi>10.1145/1168857.1168889 Full text: Pdf Quantum computing's power comes from new algorithms that exploit quantum mechanical phenomena for computation. Quantum algorithms are different from their classical counterparts in that quantum algorithms rely on algorithmic structures that are simply ... expand Introspective 3D chips Shashidhar Mysore, Banit Agrawal, Navin Srivastava, Sheng-Chih Lin, Kaustav Banerjee, Tim Sherwood Pages: 264 - 273 doi>10.1145/1168857.1168890 Full text: Pdf While the number of transistors on a chip increases exponentially over time, the productivity that can be realized from these systems has not kept pace. To deal with the complexity of modern systems, software developers are increasingly dependent on ... expand SESSION: Memory and locality issues Stealth prefetching Jason F. Cantin, Mikko H. Lipasti, James E. Smith Pages: 274 - 282 doi>10.1145/1168857.1168892 Full text: Pdf Prefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching techniques ... expand Computation spreading: employing hardware migration to specialize CMP cores on-the-fly Koushik Chakraborty, Philip M. Wells, Gurindar S. Sohi Pages: 283 - 292 doi>10.1145/1168857.1168893 Full text: Pdf In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different ... expand Software-based instruction caching for embedded processors Jason E. Miller, Anant Agarwal Pages: 293 - 302 doi>10.1145/1168857.1168894 Full text: Pdf While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly addressed ... expand SESSION: Embedded and special-purpose systems Mapping esterel onto a multi-threaded embedded processor Xin Li, Marian Boldt, Reinhard von Hanxleden Pages: 303 - 314 doi>10.1145/1168857.1168896 Full text: Pdf The synchronous language Esterel is well-suited for programming control-dominated reactive systems at the system level. It provides non-traditional control structures, in particular concurrency and various forms of preemption, which allow to concisely ... expand Integrated network interfaces for high-bandwidth TCP/IP Nathan L. Binkert, Ali G. Saidi, Steven K. Reinhardt Pages: 315 - 324 doi>10.1145/1168857.1168897 Full text: Pdf This paper proposes new network interface controller (NIC) designs that take advantage of integration with the host CPU to provide increased flexibility for operating system kernel-based performance optimization.We believe that this approach is more ... expand Accelerator: using data parallelism to program GPUs for general-purpose uses David Tarditi, Sidd Puri, Jose Oglesby Pages: 325 - 335 doi>10.1145/1168857.1168898 Full text: Pdf GPUs are difficult to program for general-purpose uses. Programmers can either learn graphics APIs and convert their applications to use graphics pipeline operations or they can use stream programming abstractions of GPUs. We describe Accelerator, a ... expand SESSION: Transactional memory Hybrid transactional memory Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, Daniel Nussbaum Pages: 336 - 346 doi>10.1145/1168857.1168900 Full text: Pdf Transactional memory (TM) promises to substantially reduce the difficulty of writing correct, efficient, and scalable concurrent programs. But "bounded" and "best-effort" hardware TM proposals impose unreasonable constraints on programmers, while more ... expand Unbounded page-based transactional memory Weihaw Chuang, Satish Narayanasamy, Ganesh Venkatesh, Jack Sampson, Michael Van Biesbrouck, Gilles Pokam, Brad Calder, Osvaldo Colavin Pages: 347 - 358 doi>10.1145/1168857.1168901 Full text: Pdf Exploiting thread level parallelism is paramount in the multicore era. Transactions enable programmers to expose such parallelism by greatly simplifying the multi-threaded programming model. Virtualized transactions (unbounded in space and time) are ... expand Supporting nested transactional memory in logTM Michelle J. Moravan, Jayaram Bobba, Kevin E. Moore, Luke Yen, Mark D. Hill, Ben Liblit, Michael M. Swift, David A. Wood Pages: 359 - 370 doi>10.1145/1168857.1168902 Full text: Pdf Nested transactional memory (TM) facilitates software composition by letting one module invoke another without either knowing whether the other uses transactions. Closed nested transactions extend isolation of an inner transaction until the toplevel ... expand Tradeoffs in transactional memory virtualization JaeWoong Chung, Chi Cao Minh, Austen McDonald, Travis Skare, Hassan Chafi, Brian D. Carlstrom, Christos Kozyrakis, Kunle Olukotun Pages: 371 - 381 doi>10.1145/1168857.1168903 Full text: Pdf For transactional memory (TM) to achieve widespread acceptance, transactions should not be limited to the physical resources of any specific hardware implementation. TM systems should guarantee correct execution even when transactions exceed scheduling ... expand SESSION: Compilation A new idiom recognition framework for exploiting hardware-assist instructions Motohiro Kawahito, Hideaki Komatsu, Takao Moriyama, Hiroshi Inoue, Toshio Nakatani Pages: 382 - 393 doi>10.1145/1168857.1168905 Full text: Pdf Modern processors support hardware-assist instructions (such as TRT and TROT instructions on IBM zSeries) to accelerate certain functions such as delimiter search and character conversion. Such special instructions have often been used in high performance ... expand Automatic generation of peephole superoptimizers Sorav Bansal, Alex Aiken Pages: 394 - 403 doi>10.1145/1168857.1168906 Full text: Pdf Peephole optimizers are typically constructed using human-written pattern matching rules, an approach that requires expertise and time, as well as being less than systematic at exploiting all opportunities for optimization. We explore fully automatic ... expand Combinatorial sketching for finite programs Armando Solar-Lezama, Liviu Tancau, Rastislav Bodik, Sanjit Seshia, Vijay Saraswat Pages: 404 - 415 doi>10.1145/1168857.1168907 Full text: Pdf Sketching is a software synthesis approach where the programmer develops a partial implementation - a sketch - and a separate specification of the desired functionality. The synthesizer then completes the sketch to behave like the specification. The ... expand A probabilistic pointer analysis for speculative optimizations Jeff Da Silva, J. Gregory Steffan Pages: 416 - 425 doi>10.1145/1168857.1168908 Full text: Pdf Pointer analysis is a critical compiler analysis used to disambiguate the indirect memory references that result from the use of pointers and pointer-based data structures. A conventional pointer analysis deduces for every pair of pointers, at any program ... expand 2008 Toward molecular programming with DNA Erik Winfree Pages: 1-1 doi>10.1145/1346281.1346282 Full text: Flv Mp3 Audio Only Biological organisms are beautiful examples of programming. The program and data are stored in biological molecules such as DNA, RNA, and proteins; the algorithms are carried out by molecular and biochemical processes; and the end result is the creation ... expand SESSION: Virtualization Overshadow: a virtualization-based approach to retrofitting protection in commodity operating systems Xiaoxin Chen, Tal Garfinkel, E. Christopher Lewis, Pratap Subrahmanyam, Carl A. Waldspurger, Dan Boneh, Jeffrey Dwoskin, Dan R.K. Ports Pages: 2-13 doi>10.1145/1346281.1346284 Full text: PDF Other formats: Avi Flv Mp3 Audio Only Commodity operating systems entrusted with securing sensitive data are remarkably large and complex, and consequently, frequently prone to compromise. To address this limitation, we introduce a virtual-machine-based system called Overshadow that protects ... expand How low can you go?: recommendations for hardware-supported minimal TCB code execution Jonathan M. McCune, Bryan Parno, Adrian Perrig, Michael K. Reiter, Arvind Seshadri Pages: 14-25 doi>10.1145/1346281.1346285 Full text: PDF Other formats: Flv Mp3 Audio Only We explore the extent to which newly available CPU-based security technology can reduce the Trusted Computing Base (TCB) for security-sensitive applications. We find that although this new technology represents a step in the right direction, significant ... expand Accelerating two-dimensional page walks for virtualized systems Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, Srilatha Manne Pages: 26-35 doi>10.1145/1346281.1346286 Full text: PDF Other formats: Flv Mp3 Audio Only Nested paging is a hardware solution for alleviating the software memory management overhead imposed by system virtualization. Nested paging complements existing page walk hardware to form a two-dimensional (2D) page walk, which reduces the need for ... expand SESSION: Power Efficiency trends and limits from comprehensive microarchitectural adaptivity Benjamin C. Lee, David Brooks Pages: 36-47 doi>10.1145/1346281.1346288 Full text: PDF Other formats: Flv Mp3 Audio Only Increasing demand for power-efficient, high-performance computing requires tuning applications and/or the underlying hardware to improve the mapping between workload heterogeneity and computational resources. To assess the potential benefits of hardware ... expand No "power" struggles: coordinated multi-level power management for the data center Ramya Raghavendra, Parthasarathy Ranganathan, Vanish Talwar, Zhikui Wang, Xiaoyun Zhu Pages: 48-59 doi>10.1145/1346281.1346289 Full text: PDF Other formats: Flv Mp3 Audio Only Power delivery, electricity consumption, and heat management are becoming key challenges in data center environments. Several past solutions have individually evaluated different techniques to address separate aspects of this problem, in hardware and ... expand Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors Chinnakrishnan S. Ballapuram, Ahmad Sharif, Hsien-Hsin S. Lee Pages: 60-69 doi>10.1145/1346281.1346290 Full text: PDF Other formats: Flv Mp3 Audio Only Integrating more processor cores on-die has become the unanimous trend in the microprocessor industry. Most of the current research thrusts using chip multiprocessors (CMPs) as the baseline to analyze problems in various domains. One of the main design ... expand PICSEL: measuring user-perceived performance to control dynamic frequency scaling Arindam Mallik, Jack Cosgrove, Robert P. Dick, Gokhan Memik, Peter Dinda Pages: 70-79 doi>10.1145/1346281.1346291 Full text: PDF Other formats: Flv Mp3 Audio Only The ultimate goal of a computer system is to satisfy its users. The success of architectural or system-level optimizations depends largely on having accurate metrics for user satisfaction. We propose to derive such metrics from information that is "close ... expand SESSION: Programming Improving the performance of object-oriented languages with dynamic predication of indirect jumps Jose A. Joao, Onur Mutlu, Hyesoon Kim, Rishi Agarwal, Yale N. Patt Pages: 80-90 doi>10.1145/1346281.1346293 Full text: PDF Other formats: Flv Mp3 Audio Only Indirect jump instructions are used to implement increasingly-common programming constructs such as virtual function calls, switch-case statements, jump tables, and interface calls. The performance impact of indirect jumps is likely to increase because ... expand The mapping collector: virtual memory support for generational, parallel, and concurrent compaction Michal Wegiel, Chandra Krintz Pages: 91-102 doi>10.1145/1346281.1346294 Full text: PDF Other formats: Flv Mp3 Audio Only Parallel and concurrent garbage collectors are increasingly employed by managed runtime environments (MREs) to maintain scalability, as multi-core architectures and multi-threaded applications become pervasive. Moreover, state-of-the-art MREs commonly ... expand Hardbound: architectural support for spatial safety of the C programming language Joe Devietti, Colin Blundell, Milo M. K. Martin, Steve Zdancewic Pages: 103-114 doi>10.1145/1346281.1346295 Full text: PDF Other formats: Flv Mp3 Audio Only The C programming language is at least as well known for its absence of spatial memory safety guarantees (i.e., lack of bounds checking) as it is for its high performance. C's unchecked pointer arithmetic and array indexing allow simple programming mistakes ... expand Archipelago: trading address space for reliability and security Vitaliy B. Lvin, Gene Novark, Emery D. Berger, Benjamin G. Zorn Pages: 115-124 doi>10.1145/1346281.1346296 Full text: PDF Other formats: Flv Mp3 Audio Only Memory errors are a notorious source of security vulnerabilities that can lead to service interruptions, information leakage and unauthorized access. Because such errors are also difficult to debug, the absence of timely patches can leave users vulnerable ... expand SESSION: Microarchitecture Accurate branch prediction for short threads Bumyong Choi, Leo Porter, Dean M. Tullsen Pages: 125-134 doi>10.1145/1346281.1346298 Full text: PDF Other formats: Flv Mp3 Audio Only Multi-core processors, with low communication costs and high availability of execution cores, will increase the use of execution and compilation models that use short threads to expose parallelism. Current branch predictors seek to incorporate large ... expand Adaptive set pinning: managing shared caches in chip multiprocessors Shekhar Srikantaiah, Mahmut Kandemir, Mary Jane Irwin Pages: 135-144 doi>10.1145/1346281.1346299 Full text: PDF Other formats: Flv Mp3 Audio Only As part of the trend towards Chip Multiprocessors (CMPs) for the next leap in computing performance, many architectures have explored sharing the last level of cache among different processors for better performance-cost ratio and improved resource allocation. ... expand SoftSig: software-exposed hardware signatures for code analysis and optimization James Tuck, Wonsun Ahn, Luis Ceze, Josep Torrellas Pages: 145-156 doi>10.1145/1346281.1346300 Full text: PDF Other formats: Flv Mp3 Audio Only Many code analysis techniques for optimization, debugging, or parallelization need to perform runtime disambiguation of sets of addresses. Such operations can be supported efficiently and with low complexity with hardware signatures. To enable flexible ... expand Predictor virtualization Ioana Burcea, Stephen Somogyi, Andreas Moshovos, Babak Falsafi Pages: 157-167 doi>10.1145/1346281.1346301 Full text: PDF Other formats: Flv Mp3 Audio Only Many hardware optimizations rely on collecting information about program behavior at runtime. This information is stored in lookup tables. To be accurate and effective, these optimizations usually require large dedicated on-chip tables. Although technology ... expand SESSION: Performance The design and implementation of microdrivers Vinod Ganapathy, Matthew J. Renzelmann, Arini Balakrishnan, Michael M. Swift, Somesh Jha Pages: 168-178 doi>10.1145/1346281.1346303 Full text: PDF Other formats: Flv Mp3 Audio Only Device drivers commonly execute in the kernel to achieve high performance and easy access to kernel services. However, this comes at the price of decreased reliability and increased programming difficulty. Driver programmers are unable to use user-mode ... expand Tapping into the fountain of CPUs: on operating system support for programmable devices Yaron Weinsberg, Danny Dolev, Tal Anker, Muli Ben-Yehuda, Pete Wyckoff Pages: 179-188 doi>10.1145/1346281.1346304 Full text: PDF Other formats: Flv Mp3 Audio Only The constant race for faster and more powerful CPUs is drawing to a close. No longer is it feasible to significantly increase the speed of the CPU without paying a crushing penalty in power consumption and production costs. Instead of increasing single ... expand SESSION: OS Hardware counter driven on-the-fly request signatures Kai Shen, Ming Zhong, Sandhya Dwarkadas, Chuanpeng Li, Christopher Stewart, Xiao Zhang Pages: 189-200 doi>10.1145/1346281.1346306 Full text: PDF Other formats: Flv Mp3 Audio Only Today's processors provide a rich source of statistical informationon application execution through hardware counters. In this paper, we explore the utilization of these statistics as request signaturesin server applications for identifying requests ... expand Dispersing proprietary applications as benchmarks through code mutation Luk Van Ertvelde, Lieven Eeckhout Pages: 201-210 doi>10.1145/1346281.1346307 Full text: PDF Other formats: Flv Mp3 Audio Only Industry vendors hesitate to disseminate proprietary applications to academia and third party vendors. By consequence, the benchmarking process is typically driven by standardized, open-source benchmarks which may be very different from and likely not ... expand Understanding and visualizing full systems with data flow tomography Shashidhar Mysore, Bita Mazloom, Banit Agrawal, Timothy Sherwood Pages: 211-221 doi>10.1145/1346281.1346308 Full text: PDF Other formats: Flv Mp3 Audio Only It is not uncommon for modern systems to be composed of a variety of interacting services, running across multiple machines in such a way that most developers do not really understand the whole system. As abstraction is layered atop abstraction, developers ... expand SESSION: Compiler Communication optimizations for global multi-threaded instruction scheduling Guilherme Ottoni, David I. August Pages: 222-232 doi>10.1145/1346281.1346310 Full text: PDF Other formats: Flv Mp3 Audio Only The recent shift in the industry towards chip multiprocessor (CMP) designs has brought the need for multi-threaded applications to mainstream computing. As observed in several limit studies, most of the parallelization opportunities require looking for ... expand Optimistic parallelism benefits from data partitioning Milind Kulkarni, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala, L. Paul Chew Pages: 233-243 doi>10.1145/1346281.1346311 Full text: PDF Other formats: Flv Mp3 Audio Only Recent studies of irregular applications such as finite-element mesh generators and data-clustering codes have shown that these applications have a generalized data parallelism arising from the use of iterative algorithms that perform computations on ... expand Xoc, an extension-oriented compiler for systems programming Russ Cox, Tom Bergan, Austin T. Clements, Frans Kaashoek, Eddie Kohler Pages: 244-254 doi>10.1145/1346281.1346312 Full text: PDF Other formats: Flv Mp3 Audio Only Today's system programmers go to great lengths to extend the languages in which they program. For instance, system-specific compilers find errors in Linux and other systems, and add support for specialized control flow to Qt and event-based programs. ... expand SESSION: Fault tolerance Adapting to intermittent faults in multicore systems Philip M. Wells, Koushik Chakraborty, Gurindar S. Sohi Pages: 255-264 doi>10.1145/1346281.1346314 Full text: PDF Other formats: Flv Mp3 Audio Only Future multicore processors will be more susceptible to a variety of hardware failures. In particular, intermittent faults, caused in part by manufacturing, thermal, and voltage variations, can cause bursts of frequent faults that last from several ... expand Understanding the propagation of hard errors to software and implications for resilient system design Man-Lap Li, Pradeep Ramachandran, Swarup Kumar Sahoo, Sarita V. Adve, Vikram S. Adve, Yuanyuan Zhou Pages: 265-276 doi>10.1145/1346281.1346315 Full text: PDF Other formats: Flv Mp3 Audio Only With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of expensive redundancy. We explore a cooperative ... expand SESSION: Parallelism Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs M. Aater Suleman, Moinuddin K. Qureshi, Yale N. Patt Pages: 277-286 doi>10.1145/1346281.1346317 Full text: PDF Other formats: Flv Mp3 Audio Only Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number ... expand Merge: a programming model for heterogeneous multi-core systems Michael D. Linderman, Jamison D. Collins, Hong Wang, Teresa H. Meng Pages: 287-296 doi>10.1145/1346281.1346318 Full text: PDF Other formats: Flv Mp3 Audio Only In this paper we propose the Merge framework, a general purpose programming model for heterogeneous multi-core systems. The Merge framework replaces current ad hoc approaches to parallel programming on heterogeneous platforms with a rigorous, library-based ... expand Streamware: programming general-purpose multicore processors using streams Jayanth Gummaraju, Joel Coburn, Yoshio Turner, Mendel Rosenblum Pages: 297-307 doi>10.1145/1346281.1346319 Full text: PDF Other formats: Flv Mp3 Audio Only Recently, the number of cores on general-purpose processors has been increasing rapidly. Using conventional programming models, it is challenging to effectively exploit these cores for maximal performance. An interesting alternative candidate for programming ... expand SESSION: Security & bugs Parallelizing security checks on commodity hardware Edmund B. Nightingale, Daniel Peek, Peter M. Chen, Jason Flinn Pages: 308-318 doi>10.1145/1346281.1346321 Full text: PDF Other formats: Flv Mp3 Audio Only Speck (Speculative Parallel Check) is a system thataccelerates powerful security checks on commodity hardware by executing them in parallel on multiple cores. Speck provides an infrastructure that allows sequential invocations of a particular ... expand Better bug reporting with better privacy Miguel Castro, Manuel Costa, Jean-Philippe Martin Pages: 319-328 doi>10.1145/1346281.1346322 Full text: PDF Other formats: Flv Mp3 Audio Only Software vendors collect bug reports from customers to improve the quality of their software. These reports should include the inputs that make the software fail, to enable vendors to reproduce the bug. However, vendors rarely include these inputs in ... expand Learning from mistakes: a comprehensive study on real world concurrency bug characteristics Shan Lu, Soyeon Park, Eunsoo Seo, Yuanyuan Zhou Pages: 329-339 doi>10.1145/1346281.1346323 Full text: PDF Other formats: Flv Mp3 Audio Only The reality of multi-core hardware has made concurrent programs pervasive. Unfortunately, writing correct concurrent programs is difficult. Addressing this challenge requires advances in multiple directions, including concurrency bug detection, concurrent ... expand 2009 An evaluation of the TRIPS computer system Mark Gebhart, Bertrand A. Maher, Katherine E. Coons, Jeff Diamond, Paul Gratz, Mario Marino, Nitya Ranganathan, Behnam Robatmili, Aaron Smith, James Burrill, Stephen W. Keckler, Doug Burger, Kathryn S. McKinley Pages: 1-12 doi>10.1145/1508244.1508246 Full text: Pdf The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model ... expand Architectural implications of nanoscale integrated sensing and computing Constantin Pistol, Wutichai Chongchitmate, Christopher Dwyer, Alvin R. Lebeck Pages: 13-24 doi>10.1145/1508244.1508247 Full text: Pdf This paper explores the architectural implications of integrating computation and molecular probes to form nanoscale sensor processors (nSP). We show how nSPs may enable new computing domains and automate tasks that currently require expert scientific ... expand SESSION: Reliable systems I CTrigger: exposing atomicity violation bugs from their hiding places Soyeon Park, Shan Lu, Yuanyuan Zhou Pages: 25-36 doi>10.1145/1508244.1508249 Full text: Pdf Multicore hardware is making concurrent programs pervasive. Unfortunately, concurrent programs are prone to bugs. Among different types of concurrency bugs, atomicity violation bugs are common and important. Existing techniques to detect atomicity violation ... expand ASSURE: automatic software self-healing using rescue points Stelios Sidiroglou, Oren Laadan, Carlos Perez, Nicolas Viennot, Jason Nieh, Angelos D. Keromytis Pages: 37-48 doi>10.1145/1508244.1508250 Full text: Pdf Software failures in server applications are a significant problem for preserving system availability. We present ASSURE, a system that introduces rescue points that recover software from unknown faults while maintaining both system integrity and availability, ... expand Recovery domains: an organizing principle for recoverable operating systems Andrew Lenharth, Vikram S. Adve, Samuel T. King Pages: 49-60 doi>10.1145/1508244.1508251 Full text: Pdf We describe a strategy for enabling existing commodity operating systems to recover from unexpected run-time errors in nearly any part of the kernel, including core kernel components. Our approach is dynamic and request-oriented; it isolates the effects ... expand Anomaly-based bug prediction, isolation, and validation: an automated approach for software debugging Martin Dimitrov, Huiyang Zhou Pages: 61-72 doi>10.1145/1508244.1508252 Full text: Pdf Software defects, commonly known as bugs, present a serious challenge for system reliability and dependability. Once a program failure is observed, the debugging activities to locate the defects are typically nontrivial and time consuming. In this paper, ... expand SESSION: Deterministic multiprocessing Capo: a software-hardware interface for practical deterministic multiprocessor replay Pablo Montesinos, Matthew Hicks, Samuel T. King, Josep Torrellas Pages: 73-84 doi>10.1145/1508244.1508254 Full text: Pdf While deterministic replay of parallel programs is a powerful technique, current proposals have shortcomings. Specifically, software-based replay systems have high overheads on multiprocessors, while hardware-based proposals focus only on basic hardware-level ... expand DMP: deterministic shared memory multiprocessing Joseph Devietti, Brandon Lucia, Luis Ceze, Mark Oskin Pages: 85-96 doi>10.1145/1508244.1508255 Full text: Pdf Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits ... expand Kendo: efficient deterministic multithreading in software Marek Olszewski, Jason Ansel, Saman Amarasinghe Pages: 97-108 doi>10.1145/1508244.1508256 Full text: Pdf Although chip-multiprocessors have become the industry standard, developing parallel applications that target them remains a daunting task. Non-determinism, inherent in threaded applications, causes significant challenges for parallel programmers by ... expand SESSION: Prediction and accounting Complete information flow tracking from the gates up Mohit Tiwari, Hassan M.G. Wassel, Bita Mazloom, Shashidhar Mysore, Frederic T. Chong, Timothy Sherwood Pages: 109-120 doi>10.1145/1508244.1508258 Full text: Pdf For many mission-critical tasks, tight guarantees on the flow of information are desirable, for example, when handling important cryptographic keys or sensitive financial data. We present a novel architecture capable of tracking all information flow ... expand RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations David K. Tam, Reza Azimi, Livio B. Soares, Michael Stumm Pages: 121-132 doi>10.1145/1508244.1508259 Full text: Pdf Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed ... expand Per-thread cycle accounting in SMT processors Stijn Eyerman, Lieven Eeckhout Pages: 133-144 doi>10.1145/1508244.1508260 Full text: Pdf This paper proposes a cycle accounting architecture for Simultaneous Multithreading (SMT) processors that estimates the execution times for each of the threads had they been executed alone, while they are running simultaneously on the SMT processor. ... expand SESSION: Transactional memories Maximum benefit from a minimal HTM Owen S. Hofmann, Christopher J. Rossbach, Emmett Witchel Pages: 145-156 doi>10.1145/1508244.1508262 Full text: Pdf A minimal, bounded hardware transactional memory implementation significantly improves synchronization performance when used in an operating system kernel. We add HTM to Linux 2.4, a kernel with a simple, coarse-grained synchronization structure. The ... expand Early experience with a commercial hardware transactional memory implementation Dave Dice, Yossi Lev, Mark Moir, Daniel Nussbaum Pages: 157-168 doi>10.1145/1508244.1508263 Full text: Pdf We report on our experience with the hardware transactional memory (HTM) feature of two pre-production revisions of a new commercial multicore processor. Our experience includes a number of promising results using HTM to improve performance in a variety ... expand SESSION: Reliable systems II Mixed-mode multicore reliability Philip M. Wells, Koushik Chakraborty, Gurindar S. Sohi Pages: 169-180 doi>10.1145/1508244.1508265 Full text: Pdf Future processors are expected to observe increasing rates of hardware faults. Using Dual-Modular Redundancy (DMR), two cores of a multicore can be loosely coupled to redundantly execute a single software thread, providing very high coverage from many ... expand ISOLATOR: dynamically ensuring isolation in comcurrent programs Sriram Rajamani, G. Ramalingam, Venkatesh Prasad Ranganath, Kapil Vaswani Pages: 181-192 doi>10.1145/1508244.1508266 Full text: Pdf In this paper, we focus on concurrent programs that use locks to achieve isolation of data accessed by critical sections of code. We present ISOLATOR, an algorithm that guarantees isolation for well-behaved threads of a program that obey a locking discipline ... expand Efficient online validation with delta execution Joseph Tucek, Weiwei Xiong, Yuanyuan Zhou Pages: 193-204 doi>10.1145/1508244.1508267 Full text: Pdf Software systems are constantly changing. Patches to fix bugs and patches to add features are all too common. Every change risks breaking a previously working system. Hence administrators loathe change, and are willing to delay even critical security ... expand SESSION: Power and storage in enterprise systems PowerNap: eliminating server idle power David Meisner, Brian T. Gold, Thomas F. Wenisch Pages: 205-216 doi>10.1145/1508244.1508269 Full text: Pdf Data center power consumption is growing to unprecedented levels: the EPA estimates U.S. data centers will consume 100 billion kilowatt hours annually by 2011. Much of this energy is wasted in idle systems: in typical deployments, server utilization ... expand Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications Adrian M. Caulfield, Laura M. Grupp, Steven Swanson Pages: 217-228 doi>10.1145/1508244.1508270 Full text: Pdf As our society becomes more information-driven, we have begun to amass data at an astounding and accelerating rate. At the same time, power concerns have made it difficult to bring the necessary processing power to bear on querying, processing, and understanding ... expand DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings Aayush Gupta, Youngjae Kim, Bhuvan Urgaonkar Pages: 229-240 doi>10.1145/1508244.1508271 Full text: Pdf Recent technological advances in the development of flash-memory based devices have consolidated their leadership position as the preferred storage media in the embedded systems market and opened new vistas for deployment in enterprise-scale storage ... expand SESSION: Potpourri Commutativity analysis for software parallelization: letting program transformations see the big picture Farhana Aleen, Nathan Clark Pages: 241-252 doi>10.1145/1508244.1508273 Full text: Pdf Extracting performance from many-core architectures requires software engineers to create multi-threaded applications, which significantly complicates the already daunting task of software development. One solution to this problem is automatic compile-time ... expand Accelerating critical section execution with asymmetric multi-core architectures M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, Yale N. Patt Pages: 253-264 doi>10.1145/1508244.1508274 Full text: Pdf To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ... expand Producing wrong data without doing anything obviously wrong! Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, Peter F. Sweeney Pages: 265-276 doi>10.1145/1508244.1508275 Full text: Pdf This paper presents a surprising result: changing a seemingly innocuous aspect of an experimental setup can cause a systems researcher to draw wrong conclusions from an experiment. What appears to be an innocuous aspect in the experimental setup may ... expand SESSION: Managed systems Leak pruning Michael D. Bond, Kathryn S. McKinley Pages: 277-288 doi>10.1145/1508244.1508277 Full text: Pdf Managed languages improve programmer productivity with type safety and garbage collection, which eliminate memory errors such as dangling pointers, double frees, and buffer overflows. However, because garbage collection uses reachability to over-approximate ... expand Dynamic prediction of collection yield for managed runtimes Michal Wegiel, Chandra Krintz Pages: 289-300 doi>10.1145/1508244.1508278 Full text: Pdf The growth in complexity of modern systems makes it increasingly d Technology for developing regions: Moore's law is not enough Eric A. Brewer Pages: 1-2 doi>10.1145/1736020.1736021 Full text: PDF The historic focus of development has rightfully been on macroeconomics and good governance, but technology has an increasingly large role to play. In this talk, I review several novel technologies that we have deployed in India and Africa, and discuss ... expand SESSION: Novel architectures Dynamically replicated memory: building reliable systems from nanoscale resistive memories Engin Ipek, Jeremy Condit, Edmund B. Nightingale, Doug Burger, Thomas Moscibroda Pages: 3-14 doi>10.1145/1736020.1736023 Full text: PDF DRAM is facing severe scalability challenges in sub-45nm tech- nology nodes due to precise charge placement and sensing hur- dles in deep-submicron geometries. Resistive memories, such as phase-change memory (PCM), already scale well beyond DRAM and ... expand A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing Nevin Kirman, JosŽ F. Mart’nez Pages: 15-28 doi>10.1145/1736020.1736024 Full text: PDF We present an all-optical approach to constructing data networks on chip that combines the following key features: (1) Wavelength-based routing, where the route followed by a packet depends solely on the wavelength of its carrier signal, and not on information ... expand SESSION: Compilers and run-time systems A real system evaluation of hardware atomicity for software speculation Naveen Neelakantam, David R. Ditzel, Craig Zilles Pages: 29-38 doi>10.1145/1736020.1736026 Full text: PDF In this paper we evaluate the atomic region compiler abstraction by incorporating it into a commercial system. We find that atomic regions are simple and intuitive to integrate into an x86 binary-translation system. Furthermore, doing so trivially enables ... expand Dynamic filtering: multi-purpose architecture support for language runtime systems Tim Harris, Sa?a Tomic, Adri‡n Cristal, Osman Unsal Pages: 39-52 doi>10.1145/1736020.1736027 Full text: PDF This paper introduces a new abstraction to accelerate the read-barriers and write-barriers used by language runtime systems. We exploit the fact that, dynamically, many barrier executions perform checks but no real work -- e.g., in generational garbage ... expand SESSION: Parallel programming 1 CoreDet: a compiler and runtime system for deterministic multithreaded execution Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, Dan Grossman Pages: 53-64 doi>10.1145/1736020.1736029 Full text: PDF The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many ... expand Speculative parallelization using software multi-threaded transactions Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, David I. August Pages: 65-76 doi>10.1145/1736020.1736030 Full text: PDF With the right techniques, multicore architectures may be able to continue the exponential performance trend that elevated the performance of applications of all types for decades. While many scientific programs can be parallelized without speculative ... expand Respec: efficient online multiprocessor replayvia speculation and external determinism Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, Jason Flinn Pages: 77-90 doi>10.1145/1736020.1736031 Full text: PDF Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still ... expand SESSION: Scheduling in parallel systems Probabilistic job symbiosis modeling for SMT processor scheduling Stijn Eyerman, Lieven Eeckhout Pages: 91-102 doi>10.1145/1736020.1736033 Full text: PDF Symbiotic job scheduling boosts simultaneous multithreading (SMT) processor performance by co-scheduling jobs that have `compatible' demands on the processor's shared resources. Existing approaches however require a sampling phase, evaluate a limited ... expand Request behavior variations Kai Shen Pages: 103-116 doi>10.1145/1736020.1736034 Full text: PDF A large number of user requests execute (often concurrently) within a server system. A single request may exhibit fluctuating hardware characteristics (such as instruction completion rate and on-chip resource usage) over the course of its execution, ... expand Decoupling contention management from scheduling F. Ryan Johnson, Radu Stoica, Anastasia Ailamaki, Todd C. Mowry Pages: 117-128 doi>10.1145/1736020.1736035 Full text: PDF Many parallel applications exhibit unpredictable communication between threads, leading to contention for shared objects. The choice of contention management strategy impacts strongly the performance and scalability of these applications: spinning provides ... expand Addressing shared resource contention in multicore processors via scheduling Sergey Zhuravlev, Sergey Blagodurov, Alexandra Fedorova Pages: 129-142 doi>10.1145/1736020.1736036 Full text: PDF Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software ... expand SESSION: Software reliability SherLog: error diagnosis by connecting clues from run-time logs Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, Shankar Pasupathy Pages: 143-154 doi>10.1145/1736020.1736038 Full text: PDF Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability ... expand Analyzing multicore dumps to facilitate concurrency bug reproduction Dasarath Weeratunge, Xiangyu Zhang, Suresh Jagannathan Pages: 155-166 doi>10.1145/1736020.1736039 Full text: PDF Debugging concurrent programs is difficult. This is primarily because the inherent non-determinism that arises because of scheduler interleavings makes it hard to easily reproduce bugs that may manifest only under certain interleavings. The problem is ... expand A randomized scheduler with probabilistic guarantees of finding bugs Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, Santosh Nagarakatte Pages: 167-178 doi>10.1145/1736020.1736040 Full text: PDF This paper presents a randomized scheduler for finding concurrency bugs. Like current stress-testing methods, it repeatedly runs a given test program with supplied inputs. However, it improves on stress-testing by finding buggy schedules more effectively ... expand ConMem: detecting severe concurrency bugs through an effect-oriented approach Wei Zhang, Chong Sun, Shan Lu Pages: 179-192 doi>10.1145/1736020.1736041 Full text: PDF Multicore technology is making concurrent programs increasingly pervasive. Unfortunately, it is difficult to deliver reliable concurrent programs, because of the huge and non-deterministic interleaving space. In reality, without the resources to thoroughly ... expand SESSION: Hardware power and energy Characterizing processor thermal behavior Francisco Javier Mesa-Martinez, Ehsan K. Ardestani, Jose Renau Pages: 193-204 doi>10.1145/1736020.1736043 Full text: PDF Temperature is a dominant factor in the performance, reliability, and leakage power consumption of modern processors. As a result, increasing numbers of researchers evaluate thermal characteristics in their proposals. In this paper, we measure a real ... expand Conservation cores: reducing the energy of mature computations Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, Michael Bedford Taylor Pages: 205-218 doi>10.1145/1736020.1736044 Full text: PDF Growing transistor counts, limited power budgets, and the breakdown of voltage scaling are currently conspiring to create a utilization wall that limits the fraction of a chip that can run at full speed at one time. In this regime, specialized, ... expand Micro-pages: increasing DRAM efficiency with locality-aware data placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, Al Davis Pages: 219-230 doi>10.1145/1736020.1736045 Full text: PDF Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems ... expand SESSION: Data centers Power routing: dynamic power provisioning in the data center Steven Pelley, David Meisner, Pooya Zandevakili, Thomas F. Wenisch, Jack Underwood Pages: 231-242 doi>10.1145/1736020.1736047 Full text: PDF Data center power infrastructure incurs massive capital costs, which typically exceed energy costs over the life of the facility. To squeeze maximum value from the infrastructure, researchers have proposed over-subscribing power circuits, relying on ... expand Joint optimization of idle and cooling power in data centers while maintaining response time Faraz Ahmad, T. N. Vijaykumar Pages: 243-256 doi>10.1145/1736020.1736048 Full text: PDF Server power and cooling power amount to a significant fraction of modern data centers' recurring costs. While data centers provision enough servers to guarantee response times under the maximum loading, data centers operate under much less loading most ... expand SESSION: Hardware monitoring Butterfly analysis: adapting dataflow analysis to dynamic parallel monitoring Michelle L. Goodstein, Evangelos Vlachos, Shimin Chen, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry Pages: 257-270 doi>10.1145/1736020.1736050 Full text: PDF Online program monitoring is an effective technique for detecting bugs and security attacks in running applications. Extending these tools to monitor parallel programs is challenging because the tools must account for inter-thread dependences and relaxed ... expand ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Babak Falsafi, Phillip B. Gibbons, Todd C. Mowry Pages: 271-284 doi>10.1145/1736020.1736051 Full text: PDF Instruction-grain lifeguards monitor the events of a running application at the level of individual instructions in order to identify and help mitigate application bugs and security exploits. Because such lifeguards impose a 10-100X slowdown on ... expand SESSION: Parallel programming 2 MacroSS: macro-SIMDization of streaming applications Amir H. Hormati, Yoonseo Choi, Mark Woh, Manjunath Kudlur, Rodric Rabbah, Trevor Mudge, Scott Mahlke Pages: 285-296 doi>10.1145/1736020.1736053 Full text: PDF SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many application ... expand COMPASS: a programmable data prefetcher using idle GPU shaders Dong Hyuk Woo, Hsien-Hsin S. Lee Pages: 297-310 doi>10.1145/1736020.1736054 Full text: PDF A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-cost ... expand Flexible architectural support for fine-grain scheduling Daniel Sanchez, Richard M. Yoo, Christos Kozyrakis Pages: 311-322 doi>10.1145/1736020.1736055 Full text: PDF To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, managing tasks of a few thousand instructions is particularly challenging, as the runtime must ensure load balance without ... expand SESSION: Parallel memory systems Specifying and dynamically verifying address translation-aware memory consistency Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin Pages: 323-334 doi>10.1145/1736020.1736057 Full text: PDF Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation (AT) systems. Detecting bugs and faults requires a clear specification of correct behavior. To address this need, we develop a framework ... expand Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, Yale N. Patt Pages: 335-346 doi>10.1145/1736020.1736058 Full text: PDF Cores in a chip-multiprocessor (CMP) system share multiple hardware resources in the memory subsystem. If resource sharing is unfair, some applications can be delayed significantly while others are unfairly prioritized. Previous research proposed separate ... expand An asymmetric distributed shared memory model for heterogeneous parallel systems Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, Wen-mei W. Hwu Pages: 347-358 doi>10.1145/1736020.1736059 Full text: PDF Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to ... expand Inter-core cooperative TLB for chip multiprocessors Abhishek Bhattacharjee, Margaret Martonosi Pages: 359-370 doi>10.1145/1736020.1736060 Full text: PDF Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ... expand SESSION: Security and hardware reliability Orthrus: efficient software integrity protection on multi-cores Ruirui Huang, Daniel Y. Deng, G. Edward Suh Pages: 371-384 doi>10.1145/1736020.1736062 Full text: PDF This paper proposes an efficient hardware/software system that significantly enhances software security through diversified replication on multi-cores. Recent studies show that a large class of software attacks can be detected by running multiple versions ... expand Shoestring: probabilistic soft error reliability on the cheap Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott Mahlke Pages: 385-396 doi>10.1145/1736020.1736063 Full text: PDF Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to ... expand Virtualized and flexible ECC for main memory Doe Hyun Yoon, Mattan Erez Pages: 397-408 doi>10.1145/1736020.1736064 Full text: PDF We present a general scheme for virtualizing main memory error-correction mechanisms, which map redundant information needed to correct errors into the memory namespace itself. We rely on this basic idea, which increases flexibility to increase error ... expand ifficult to extract high-performance. The software stacks for such systems typically consist of multiple layers and include managed runtime environments (MREs). In this paper, we investigate ... expand TwinDrivers: semi-automatic derivation of fast and safe hypervisor network drivers from guest OS drivers Aravind Menon, Simon Schubert, Willy Zwaenepoel Pages: 301-312 doi>10.1145/1508244.1508279 Full text: Pdf In a virtualized environment, device drivers are often run inside a virtual machine (VM) rather than in the hypervisor, for reasons of safety and reduction in software engineering effort. Unfortunately, this approach results in poor performance for I/O-intensive ... expand SESSION: Architectures Phantom-BTB: a virtualized branch target buffer design Ioana Burcea, Andreas Moshovos Pages: 313-324 doi>10.1145/1508244.1508281 Full text: Pdf Modern processors use branch target buffers (BTBs) to predict the target address of branches such that they can fetch ahead in the instruction stream increasing concurrency and performance. Ideally, BTBs would be sufficiently large to capture the entire ... expand StreamRay: a stream filtering architecture for coherent ray tracing Karthik Ramani, Christiaan P. Gribble, Al Davis Pages: 325-336 doi>10.1145/1508244.1508282 Full text: Pdf The wide availability of commodity graphics processors has made real-time graphics an intrinsic component of the human/computer interface. These graphics cores accelerate the z-buffer algorithm and provide a highly interactive experience at a relatively ... expand Architectural support for SWAR text processing with parallel bit streams: the inductive doubling principle Robert D. Cameron, Dan Lin Pages: 337-348 doi>10.1145/1508244.1508283 Full text: Pdf Parallel bit stream algorithms exploit the SWAR (SIMD within a register) capabilities of commodity processors in high-performance text processing applications such as UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression matching. ... expand 2010 Technology for developing regions: Moore's law is not enough Eric A. Brewer Pages: 1-2 doi>10.1145/1736020.1736021 Full text: PDF The historic focus of development has rightfully been on macroeconomics and good governance, but technology has an increasingly large role to play. In this talk, I review several novel technologies that we have deployed in India and Africa, and discuss ... expand SESSION: Novel architectures Dynamically replicated memory: building reliable systems from nanoscale resistive memories Engin Ipek, Jeremy Condit, Edmund B. Nightingale, Doug Burger, Thomas Moscibroda Pages: 3-14 doi>10.1145/1736020.1736023 Full text: PDF DRAM is facing severe scalability challenges in sub-45nm tech- nology nodes due to precise charge placement and sensing hur- dles in deep-submicron geometries. Resistive memories, such as phase-change memory (PCM), already scale well beyond DRAM and ... expand A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing Nevin Kirman, JosŽ F. Mart’nez Pages: 15-28 doi>10.1145/1736020.1736024 Full text: PDF We present an all-optical approach to constructing data networks on chip that combines the following key features: (1) Wavelength-based routing, where the route followed by a packet depends solely on the wavelength of its carrier signal, and not on information ... expand SESSION: Compilers and run-time systems A real system evaluation of hardware atomicity for software speculation Naveen Neelakantam, David R. Ditzel, Craig Zilles Pages: 29-38 doi>10.1145/1736020.1736026 Full text: PDF In this paper we evaluate the atomic region compiler abstraction by incorporating it into a commercial system. We find that atomic regions are simple and intuitive to integrate into an x86 binary-translation system. Furthermore, doing so trivially enables ... expand Dynamic filtering: multi-purpose architecture support for language runtime systems Tim Harris, Sa?a Tomic, Adri‡n Cristal, Osman Unsal Pages: 39-52 doi>10.1145/1736020.1736027 Full text: PDF This paper introduces a new abstraction to accelerate the read-barriers and write-barriers used by language runtime systems. We exploit the fact that, dynamically, many barrier executions perform checks but no real work -- e.g., in generational garbage ... expand SESSION: Parallel programming 1 CoreDet: a compiler and runtime system for deterministic multithreaded execution Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, Dan Grossman Pages: 53-64 doi>10.1145/1736020.1736029 Full text: PDF The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many ... expand Speculative parallelization using software multi-threaded transactions Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, David I. August Pages: 65-76 doi>10.1145/1736020.1736030 Full text: PDF With the right techniques, multicore architectures may be able to continue the exponential performance trend that elevated the performance of applications of all types for decades. While many scientific programs can be parallelized without speculative ... expand Respec: efficient online multiprocessor replayvia speculation and external determinism Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, Jason Flinn Pages: 77-90 doi>10.1145/1736020.1736031 Full text: PDF Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still ... expand SESSION: Scheduling in parallel systems Probabilistic job symbiosis modeling for SMT processor scheduling Stijn Eyerman, Lieven Eeckhout Pages: 91-102 doi>10.1145/1736020.1736033 Full text: PDF Symbiotic job scheduling boosts simultaneous multithreading (SMT) processor performance by co-scheduling jobs that have `compatible' demands on the processor's shared resources. Existing approaches however require a sampling phase, evaluate a limited ... expand Request behavior variations Kai Shen Pages: 103-116 doi>10.1145/1736020.1736034 Full text: PDF A large number of user requests execute (often concurrently) within a server system. A single request may exhibit fluctuating hardware characteristics (such as instruction completion rate and on-chip resource usage) over the course of its execution, ... expand Decoupling contention management from scheduling F. Ryan Johnson, Radu Stoica, Anastasia Ailamaki, Todd C. Mowry Pages: 117-128 doi>10.1145/1736020.1736035 Full text: PDF Many parallel applications exhibit unpredictable communication between threads, leading to contention for shared objects. The choice of contention management strategy impacts strongly the performance and scalability of these applications: spinning provides ... expand Addressing shared resource contention in multicore processors via scheduling Sergey Zhuravlev, Sergey Blagodurov, Alexandra Fedorova Pages: 129-142 doi>10.1145/1736020.1736036 Full text: PDF Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software ... expand SESSION: Software reliability SherLog: error diagnosis by connecting clues from run-time logs Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, Shankar Pasupathy Pages: 143-154 doi>10.1145/1736020.1736038 Full text: PDF Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability ... expand Analyzing multicore dumps to facilitate concurrency bug reproduction Dasarath Weeratunge, Xiangyu Zhang, Suresh Jagannathan Pages: 155-166 doi>10.1145/1736020.1736039 Full text: PDF Debugging concurrent programs is difficult. This is primarily because the inherent non-determinism that arises because of scheduler interleavings makes it hard to easily reproduce bugs that may manifest only under certain interleavings. The problem is ... expand A randomized scheduler with probabilistic guarantees of finding bugs Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, Santosh Nagarakatte Pages: 167-178 doi>10.1145/1736020.1736040 Full text: PDF This paper presents a randomized scheduler for finding concurrency bugs. Like current stress-testing methods, it repeatedly runs a given test program with supplied inputs. However, it improves on stress-testing by finding buggy schedules more effectively ... expand ConMem: detecting severe concurrency bugs through an effect-oriented approach Wei Zhang, Chong Sun, Shan Lu Pages: 179-192 doi>10.1145/1736020.1736041 Full text: PDF Multicore technology is making concurrent programs increasingly pervasive. Unfortunately, it is difficult to deliver reliable concurrent programs, because of the huge and non-deterministic interleaving space. In reality, without the resources to thoroughly ... expand SESSION: Hardware power and energy Characterizing processor thermal behavior Francisco Javier Mesa-Martinez, Ehsan K. Ardestani, Jose Renau Pages: 193-204 doi>10.1145/1736020.1736043 Full text: PDF Temperature is a dominant factor in the performance, reliability, and leakage power consumption of modern processors. As a result, increasing numbers of researchers evaluate thermal characteristics in their proposals. In this paper, we measure a real ... expand Conservation cores: reducing the energy of mature computations Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, Michael Bedford Taylor Pages: 205-218 doi>10.1145/1736020.1736044 Full text: PDF Growing transistor counts, limited power budgets, and the breakdown of voltage scaling are currently conspiring to create a utilization wall that limits the fraction of a chip that can run at full speed at one time. In this regime, specialized, ... expand Micro-pages: increasing DRAM efficiency with locality-aware data placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, Al Davis Pages: 219-230 doi>10.1145/1736020.1736045 Full text: PDF Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems ... expand SESSION: Data centers Power routing: dynamic power provisioning in the data center Steven Pelley, David Meisner, Pooya Zandevakili, Thomas F. Wenisch, Jack Underwood Pages: 231-242 doi>10.1145/1736020.1736047 Full text: PDF Data center power infrastructure incurs massive capital costs, which typically exceed energy costs over the life of the facility. To squeeze maximum value from the infrastructure, researchers have proposed over-subscribing power circuits, relying on ... expand Joint optimization of idle and cooling power in data centers while maintaining response time Faraz Ahmad, T. N. Vijaykumar Pages: 243-256 doi>10.1145/1736020.1736048 Full text: PDF Server power and cooling power amount to a significant fraction of modern data centers' recurring costs. While data centers provision enough servers to guarantee response times under the maximum loading, data centers operate under much less loading most ... expand SESSION: Hardware monitoring Butterfly analysis: adapting dataflow analysis to dynamic parallel monitoring Michelle L. Goodstein, Evangelos Vlachos, Shimin Chen, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry Pages: 257-270 doi>10.1145/1736020.1736050 Full text: PDF Online program monitoring is an effective technique for detecting bugs and security attacks in running applications. Extending these tools to monitor parallel programs is challenging because the tools must account for inter-thread dependences and relaxed ... expand ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Babak Falsafi, Phillip B. Gibbons, Todd C. Mowry Pages: 271-284 doi>10.1145/1736020.1736051 Full text: PDF Instruction-grain lifeguards monitor the events of a running application at the level of individual instructions in order to identify and help mitigate application bugs and security exploits. Because such lifeguards impose a 10-100X slowdown on ... expand SESSION: Parallel programming 2 MacroSS: macro-SIMDization of streaming applications Amir H. Hormati, Yoonseo Choi, Mark Woh, Manjunath Kudlur, Rodric Rabbah, Trevor Mudge, Scott Mahlke Pages: 285-296 doi>10.1145/1736020.1736053 Full text: PDF SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many application ... expand COMPASS: a programmable data prefetcher using idle GPU shaders Dong Hyuk Woo, Hsien-Hsin S. Lee Pages: 297-310 doi>10.1145/1736020.1736054 Full text: PDF A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-cost ... expand Flexible architectural support for fine-grain scheduling Daniel Sanchez, Richard M. Yoo, Christos Kozyrakis Pages: 311-322 doi>10.1145/1736020.1736055 Full text: PDF To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, managing tasks of a few thousand instructions is particularly challenging, as the runtime must ensure load balance without ... expand SESSION: Parallel memory systems Specifying and dynamically verifying address translation-aware memory consistency Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin Pages: 323-334 doi>10.1145/1736020.1736057 Full text: PDF Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation (AT) systems. Detecting bugs and faults requires a clear specification of correct behavior. To address this need, we develop a framework ... expand Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, Yale N. Patt Pages: 335-346 doi>10.1145/1736020.1736058 Full text: PDF Cores in a chip-multiprocessor (CMP) system share multiple hardware resources in the memory subsystem. If resource sharing is unfair, some applications can be delayed significantly while others are unfairly prioritized. Previous research proposed separate ... expand An asymmetric distributed shared memory model for heterogeneous parallel systems Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, Wen-mei W. Hwu Pages: 347-358 doi>10.1145/1736020.1736059 Full text: PDF Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to ... expand Inter-core cooperative TLB for chip multiprocessors Abhishek Bhattacharjee, Margaret Martonosi Pages: 359-370 doi>10.1145/1736020.1736060 Full text: PDF Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ... expand SESSION: Security and hardware reliability Orthrus: efficient software integrity protection on multi-cores Ruirui Huang, Daniel Y. Deng, G. Edward Suh Pages: 371-384 doi>10.1145/1736020.1736062 Full text: PDF This paper proposes an efficient hardware/software system that significantly enhances software security through diversified replication on multi-cores. Recent studies show that a large class of software attacks can be detected by running multiple versions ... expand Shoestring: probabilistic soft error reliability on the cheap Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott Mahlke Pages: 385-396 doi>10.1145/1736020.1736063 Full text: PDF Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to ... expand Virtualized and flexible ECC for main memory Doe Hyun Yoon, Mattan Erez Pages: 397-408 doi>10.1145/1736020.1736064 Full text: PDF We present a general scheme for virtualizing main memory error-correction mechanisms, which map redundant information needed to correct errors into the memory namespace itself. We rely on this basic idea, which increases flexibility to increase error ... expand 2011 The cloud will change everything James R. Larus Pages: 1-2 doi>10.1145/1950365.1950367 Full text: PDF Cloud computing is fast on its way to becoming a meaningless, oversold marketing slogan. In the midst of this hype, it is easy to overlook the fundamental change that is occurring. Computation, which used to be confined to the machine beside your desk, ... expand SESSION: Better logging support for software debugging Michael Swift Improving software diagnosability via log enhancement Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, Stefan Savage Pages: 3-14 doi>10.1145/1950365.1950369 Full text: PDF Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental complexity of trouble-shooting any complex software system, but further exacerbated by the paucity of information that is typically available in the production ... expand DoublePlay: parallelizing sequential logging and replay Kaushik Veeraraghavan, Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn, Satish Narayanasamy Pages: 15-26 doi>10.1145/1950365.1950370 Full text: PDF Deterministic replay systems record and reproduce the execution of a hardware or software system. In contrast to replaying execution on uniprocessors, deterministic replay on multiprocessors is very challenging to implement efficiently because of the ... expand SESSION: Understanding and improving transactional memory Michael Swift Hardware acceleration of transactional memory on commodity systems Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan G. Bronson, Christos Kozyrakis, Kunle Olukotun Pages: 27-38 doi>10.1145/1950365.1950372 Full text: PDF The adoption of transactional memory is hindered by the high overhead of software transactional memory and the intrusive design changes required by previously proposed TM hardware. We propose that hardware to accelerate software transactional memory ... expand Hybrid NOrec: a case study in the effectiveness of best effort hardware transactional memory Luke Dalessandro, Franois Carouge, Sean White, Yossi Lev, Mark Moir, Michael L. Scott, Michael F. Spear Pages: 39-52 doi>10.1145/1950365.1950373 Full text: PDF Transactional memory (TM) is a promising synchronization mechanism for the next generation of multicore processors. Best-effort Hardware Transactional Memory (HTM) designs, such as Sun's prototype Rock processor and AMD's proposed Advanced Synchronization ... expand SESSION: Innovations in memory ordering models for parallel machines James Laudon Efficient processor support for DRFx, a memory model with exceptions Abhayendra Singh, Daniel Marino, Satish Narayanasamy, Todd Millstein, Madan Musuvathi Pages: 53-66 doi>10.1145/1950365.1950375 Full text: PDF A longstanding challenge of shared-memory concurrency is to provide a memory model that allows for efficient implementation while providing strong and simple guarantees to programmers. The C++0x and Java memory models admit a wide variety of compiler ... expand RCDC: a relaxed consistency deterministic computer Joseph Devietti, Jacob Nelson, Tom Bergan, Luis Ceze, Dan Grossman Pages: 67-78 doi>10.1145/1950365.1950376 Full text: PDF Providing deterministic execution significantly simplifies the debugging, testing, replication, and deployment of multithreaded programs. Recent work has developed deterministic multiprocessor architectures as well as compiler and runtime systems that ... expand Specifying and checking semantic atomicity for multithreaded programs Jacob Burnim, George Necula, Koushik Sen Pages: 79-90 doi>10.1145/1950365.1950377 Full text: PDF In practice, it is quite difficult to write correct multithreaded programs due to the potential for unintended and nondeterministic interference between parallel threads. A fundamental correctness property for such programs is atomicity---a block of ... expand SESSION: Programming for persistent memory Thomas F. Wenisch Mnemosyne: lightweight persistent memory Haris Volos, Andres Jaan Tack, Michael M. Swift Pages: 91-104 doi>10.1145/1950365.1950379 Full text: PDF New storage-class memory (SCM) technologies, such as phase-change memory, STT-RAM, and memristors, promise user-level access to non-volatile storage through regular memory instructions. These memory devices enable fast user-mode access to persistence, ... expand NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, Steven Swanson Pages: 105-118 doi>10.1145/1950365.1950380 Full text: PDF Persistent, user-defined objects present an attractive abstraction for working with non-volatile program state. However, the slow speed of persistent storage (i.e., disk) has restricted their design and limited their performance. Fast, byte-addressable, ... expand SESSION: Enhancing device driver reliability Yuanyuan Zhou A declarative language approach to device configuration Adrian SchŸpbach, Andrew Baumann, Timothy Roscoe, Simon Peter Pages: 119-132 doi>10.1145/1950365.1950382 Full text: PDF C remains the language of choice for hardware programming (device drivers, bus configuration, etc.): it is fast, allows low-level access, and is trusted by OS developers. However, the algorithms required to configure and reconfigure hardware devices ... expand Improved device driver reliability through hardware verification reuse Leonid Ryzhyk, John Keys, Balachandra Mirla, Arun Raghunath, Mona Vij, Gernot Heiser Pages: 133-144 doi>10.1145/1950365.1950383 Full text: PDF Faulty device drivers are a major source of operating system failures. We argue that the underlying cause of many driver faults is the separation of two highly-related tasks: device verification and driver development. These two tasks have a lot in common, ... expand SESSION: Novel computing platforms Luis Ceze A case for neuromorphic ISAs Atif Hashmi, Andrew Nere, James Jamal Thomas, Mikko Lipasti Pages: 145-158 doi>10.1145/1950365.1950385 Full text: PDF The desire to create novel computing systems, paired with recent advances in neuroscientific understanding of the brain, has led researchers to develop neuromorphic architectures that emulate the brain. To date, such models are developed, trained, and ... expand Mementos: system support for long-running computation on RFID-scale devices Benjamin Ransford, Jacob Sorber, Kevin Fu Pages: 159-170 doi>10.1145/1950365.1950386 Full text: PDF Transiently powered computing devices such as RFID tags, kinetic energy harvesters, and smart cards typically rely on programs that complete a task under tight time constraints before energy starvation leads to complete loss of volatile memory. Mementos ... expand Pocket cloudlets Emmanouil Koukoumidis, Dimitrios Lymberopoulos, Karin Strauss, Jie Liu, Doug Burger Pages: 171-184 doi>10.1145/1950365.1950387 Full text: PDF Cloud services accessed through mobile devices suffer from high network access latencies and are constrained by energy budgets dictated by the devices' batteries. Radio and battery technologies will improve over time, but are still expected to be the ... expand SESSION: Saving power and energy Jim Larus Blink: managing server clusters on intermittent power Navin Sharma, Sean Barker, David Irwin, Prashant Shenoy Pages: 185-198 doi>10.1145/1950365.1950389 Full text: PDF Reducing the energy footprint of data centers continues to receive significant attention due to both its financial and environmental impact. There are numerous methods that limit the impact of both factors, such as expanding the use of renewable energy ... expand Dynamic knobs for responsive power-aware computing Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal, Martin Rinard Pages: 199-212 doi>10.1145/1950365.1950390 Full text: PDF We present PowerDial, a system for dynamically adapting application behavior to execute successfully in the face of load and power fluctuations. PowerDial transforms static configuration parameters into dynamic knobs that the PowerDial control system ... expand Flikker: saving DRAM refresh-power through critical data partitioning Song Liu, Karthik Pattabiraman, Thomas Moscibroda, Benjamin G. Zorn Pages: 213-224 doi>10.1145/1950365.1950391 Full text: PDF Energy has become a first-class design constraint in computer systems. Memory is a significant contributor to total system power. This paper introduces Flikker, an application-level technique to reduce refresh power in DRAM memories. Flikker enables ... expand MemScale: active low-power modes for main memory Qingyuan Deng, David Meisner, Luiz Ramos, Thomas F. Wenisch, Ricardo Bianchini Pages: 225-238 doi>10.1145/1950365.1950392 Full text: PDF Main memory is responsible for a large and increasing fraction of the energy consumed by servers. Prior work has focused on exploiting DRAM low-power states to conserve energy. However, these states require entire DRAM ranks to be idled, which is difficult ... expand SESSION: Recognizing software and concurrency bugs Emery Berger 2ndStrike: toward manifesting hidden concurrency typestate bugs Qi Gao, Wenbin Zhang, Zhezhe Chen, Mai Zheng, Feng Qin Pages: 239-250 doi>10.1145/1950365.1950394 Full text: PDF Concurrency bugs are becoming increasingly prevalent in the multi-core era. Recently, much research has focused on data races and atomicity violation bugs, which are related to low-level memory accesses. However, a large number of concurrency typestate ... expand ConSeq: detecting concurrency bugs through sequential errors Wei Zhang, Junghee Lim, Ramya Olichandran, Joel Scherpelz, Guoliang Jin, Shan Lu, Thomas Reps Pages: 251-264 doi>10.1145/1950365.1950395 Full text: PDF Concurrency bugs are caused by non-deterministic interleavings between shared memory accesses. Their effects propagate through data and control dependences until they cause software to crash, hang, produce incorrect output, etc. The lifecycle of a bug ... expand S2E: a platform for in-vivo multi-path analysis of software systems Vitaly Chipounov, Volodymyr Kuznetsov, George Candea Pages: 265-278 doi>10.1145/1950365.1950396 Full text: PDF This paper presents S2E, a platform for analyzing the properties and behavior of software systems. We demonstrate S2E's use in developing practical tools for comprehensive performance profiling, reverse engineering of proprietary software, and bug finding ... expand SESSION: Rethinking and protecting operating systems Orran Krieger Ensuring operating system kernel integrity with OSck Owen S. Hofmann, Alan M. Dunn, Sangman Kim, Indrajit Roy, Emmett Witchel Pages: 279-290 doi>10.1145/1950365.1950398 Full text: PDF Kernel rootkits that modify operating system state to avoid detection are a dangerous threat to system security. This paper presents OSck, a system that discovers kernel rootkits by detecting malicious modifications to operating system data. OSck integrates ... expand Rethinking the library OS from the top down Donald E. Porter, Silas Boyd-Wickizer, Jon Howell, Reuben Olinsky, Galen C. Hunt Pages: 291-304 doi>10.1145/1950365.1950399 Full text: PDF This paper revisits an old approach to operating system construc-tion, the library OS, in a new context. The idea of the library OS is that the personality of the OS on which an application depends runs in the address space of the application. A small, ... expand SESSION: Learning from the past: drawing conclusions from extensive measurement studies Orran Krieger Faults in linux: ten years later Nicolas Palix, Ga‘l Thomas, Suman Saha, Christophe Calvs, Julia Lawall, Gilles Muller Pages: 305-318 doi>10.1145/1950365.1950401 Full text: PDF In 2001, Chou et al. published a study of faults found by applying a static analyzer to Linux versions 1.0 through 2.4.1. A major result of their work was that the drivers directory contained up to 7 times more of certain kinds of faults than other directories. ... expand Looking back on the language and hardware revolutions: measured power, performance, and scaling Hadi Esmaeilzadeh, Ting Cao, Yang Xi, Stephen M. Blackburn, Kathryn S. McKinley Pages: 319-332 doi>10.1145/1950365.1950402 Full text: PDF This paper reports and analyzes measured chip power and performance on five process technology generations executing 61 diverse benchmarks with a rigorous methodology. We measure representative Intel IA32 processors with technologies ranging from 130nm ... expand SESSION: New compiler optimizations Scott Mahlke Synthesizing concurrent schedulers for irregular algorithms Donald Nguyen, Keshav Pingali Pages: 333-344 doi>10.1145/1950365.1950404 Full text: PDF Scheduling is the assignment of tasks or activities to processors for execution, and it is an important concern in parallel programming. Most prior work on scheduling has focused either on static scheduling of applications in which the dependence graph ... expand Exploring circuit timing-aware language and compilation Giang Hoang, Robby Bruce Findler, Russ Joseph Pages: 345-356 doi>10.1145/1950365.1950405 Full text: PDF By adjusting the design of the ISA and enabling circuit timing-sensitive optimizations in a compiler, we can more effectively exploit timing speculation. While there has been growing interest in systems that leverage circuit-level timing speculation ... expand Orchestration by approximation: mapping stream programs onto multicore architectures Sardar M. Farhad, Yousun Ko, Bernd Burgstaller, Bernhard Scholz Pages: 357-368 doi>10.1145/1950365.1950406 Full text: PDF We present a novel 2-approximation algorithm for deploying stream graphs on multicore computers and a stream graph transformation that eliminates bottlenecks. The key technical insight is a data rate transfer model that enables the computation of a "closed ... expand SESSION: Exploiting parallelism on GPUs Kunle Olukuton On-the-fly elimination of dynamic irregularities for GPU computing Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, Xipeng Shen Pages: 369-380 doi>10.1145/1950365.1950408 Full text: PDF The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control ... expand Sponge: portable stream programming on graphics engines Amir H. Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, Scott Mahlke Pages: 381-392 doi>10.1145/1950365.1950409 Full text: PDF Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, ... expand SESSION: Novel performance improvements Kunle Olukuton Inter-core prefetching for multicore processors using migrating helper threads Md Kamruzzaman, Steven Swanson, Dean M. Tullsen Pages: 393-404 doi>10.1145/1950365.1950411 Full text: PDF Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques ... expand Improving the performance of trace-based systems by false loop filtering Hiroshige Hayashizaki, Peng Wu, Hiroshi Inoue, Mauricio J. Serrano, Toshio Nakatani Pages: 405-418 doi>10.1145/1950365.1950412 Full text: PDF Trace-based compilation is a promising technique for language compilers and binary translators. It offers the potential to expand the compilation scopes that have traditionally been limited by method boundaries. Detecting repeating cyclic execution paths ... expand