1987	Hardware architectures for programming languages and programming languages for hardware architectures
Nicklaus Wirth
Pages: 2 - 8
doi>10.1145/36206.36178
Full text: Pdf

Programming Languages and Operating Systems introduce abstractions which allow the programmer to ignore details of an implementation. Support of an abstraction must not only concentrate on promoting the efficiency of an implementation, but also on providing ... expand
VLSI assist for a multiprocessor
Bob Beck, Bob Kasten, Shreekant Thakkar
Pages: 10 - 20
doi>10.1145/36206.36179
Full text: Pdf

Multiprocessors have long been of interest to computer community. They provide the potential for accelerating applications through parallelism and increased throughput for large multi-user system. Three factors have limited the commercial success of ... expand
Architectural support for multilanguage parallel programming on heterogeneous systems
Roberto Bisiani, Alessandro Forin
Pages: 21 - 30
doi>10.1145/36206.36180
Full text: Pdf

We have designed and implemented a software facility, called Agora, that supports the development of parallel applications written in multiple languages. At the core of Agora there is a mechanism that allows concurrent computations to share data structures ... expand
Machine-independent virtual memory management for paged uniprocessor and multiprocessor architectures
Richard Rashid, Avadis Tevanian, Michael Young, David Golub, Robert Baron, David Black, William Bolosky, Jonathan Chew
Pages: 31 - 39
doi>10.1145/36206.36181
Full text: Pdf

This paper describes the design and implementation of virtual memory management within the CMU Mach Operating System and the experiences gained by the Mach kernel group in porting that system to a variety of architectures. As of this writing, Mach runs ... expand
An architecture for the direct execution of the Forth programming language
John R. Hayes, Martin E. Fraeman, Robert L. Williams, Thomas Zaremba
Pages: 42 - 49
doi>10.1145/36206.36182
Full text: Pdf

We have developed a simple direct execution architecture for a 32 bit Forth microprocessor. The processor can directly access a linear address space of over 4 gigawords. Two instruction types are defined; a subroutine call, and a user defined microcode ... expand
Tags and type checking in LISP: hardware and software approaches
Peter Steenkiste, John Hennessy
Pages: 50 - 59
doi>10.1145/36206.36183
Full text: Pdf

One of the major factors that distinguishes LISP from many other languages (Pascal, C, Fortran, etc.) is the need for run-time type checking. Run-time type checking is implemented by adding to each data object a tag that encodes type information. Tags ... expand
The effect of instruction set complexity on program size and memory performance
Jack W. Davidson, Richard A. Vaughan
Pages: 60 - 64
doi>10.1145/36206.36184
Full text: Pdf

One potential disadvantage of a machine with a reduced instruction set is that object programs may be substantially larger than those for a machine with a richer, more complex instruction set. The main reason is that a small instruction set will require ... expand
The dragon processor
Russell R. Atkinson, Edward M. McCreight
Pages: 65 - 69
doi>10.1145/36206.36185
Full text: Pdf

The Xerox PARC Dragon is a VLSI research computer that uses several techniques to achieve dense code and fast procedure calls in a system that can support multiple processors on a central high bandwidth memory bus. expand
Coherency for multiprocessor virtual address caches
James R. Goodman
Pages: 72 - 81
doi>10.1145/36206.36186
Full text: Pdf

A multiprocessor cache memory system is described that supplies data to the processor based on virtual addresses, but maintains consistency in the main memory, both across caches and across virtual address spaces. Pages in the same or different address ... expand
Cheap hardware support for software debugging and profiling
T. A. Cargill, B. N. Locanthi
Pages: 82 - 83
doi>10.1145/36206.36187
Full text: Pdf

We wish to determine the effectiveness of some simple hardware for debugging and profiling compiled programs on a conventional processor. The hardware cost is small -- a counter decremented on each instruction that raises an exception when its value ... expand
An experimental coprocessor for implementing persistent objects on an IBM 4381
C. J. Georgiou, S. L. Palmer, P. L. Rosenfeld
Pages: 84 - 87
doi>10.1145/36206.36188
Full text: Pdf

In this paper we describe an experimental coprocessor for an IBM 4381 that is designed to facilitate the exploration of persistent objects. expand
Integer multiplication and division on the HP precision architecture
Daniel J. Magenheimer, Liz Peters, Karl Pettis, Dan Zuras
Pages: 90 - 99
doi>10.1145/36206.36189
Full text: Pdf

In recent years, many architectural design efforts have focused on maximizing performance for frequently executed, simple instructions. Although these efforts have resulted in machines with better average price/performance ratios, certain complex instructions ... expand
The Mahler experience: using an intermediate language as the machine description
David W. Wall, Michael L. Powell
Pages: 100 - 104
doi>10.1145/36206.36190
Full text: Pdf

Division of a compiler into a front end and a back end that communicate via an intermediate language is a well-known technique. We go farther and use the intermediate language as the official description of a family of machines with simple instruction ... expand
A study of scalar compilation techniques for pipelined supercomputers
Shlomo Weiss, James E. Smith
Pages: 105 - 109
doi>10.1145/36206.36191
Full text: Pdf

This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size ... expand
Compiling Smalltalk-80 to a RISC
William R. Bush, A. Dain Samples, David Ungar, Paul N. Hilfinger
Pages: 112 - 116
doi>10.1145/36206.36192
Full text: Pdf

The Smalltalk On A RISC project at U. C. Berkeley proves that a high-level object-oriented language can attain high performance on a modified reduced instruction set architecture. The single most important optimization is the removal of a layer of interpretation, ... expand
How many addressing modes are enough?
F. Chow, S. Correll, M. Himelstein, E. Killian, L. Weber
Pages: 117 - 121
doi>10.1145/36206.36193
Full text: Pdf

Programs naturally require a variety of memory-addressing modes. It isn't necessary to provide them in hardware, however, if a compiler can synthesize them from a few primitive modes. This not only simplifies the hardware, but also permits the compiler ... expand
Superoptimizer: a look at the smallest program
Henry Massalin
Pages: 122 - 126
doi>10.1145/36206.36194
Full text: Pdf

Given an instruction set, the superoptimizer finds the shortest program to compute a function. Startling programs have been generated, many of them engaging in convoluted bit-fiddling bearing little resemblance to the source programs which defined the ... expand
Performance and architectural evaluation of the PSI machine
Kazuo Taki, Katzuto Nakajima, Hiroshi Nakashima, Morihiro Ikeda
Pages: 128 - 135
doi>10.1145/36206.36195
Full text: Pdf

We evaluated a Prolog machine PSI (Personal Sequential Inference machine) for the purpose of improving and redesigning it. In this evaluation, we measured the execution speed and the dynamic characteristics of cache memory, register file, and branching ... expand
RISCs vs. CISCs for Prolog: a case study
Gaetano Borriello, Andrew R. Cherenson, Peter B. Danzig, Michael N. Nelson
Pages: 136 - 145
doi>10.1145/36206.36196
Full text: Pdf

This paper compares the performance of executing compiled Prolog code on two different architectures under development at U. C. Berkeley. The first is the PLM, a special-purpose CISC architecture intended as a coprocessor for a host machine. The second ... expand
A RISC architecture for symbolic computation
Richard B. Kieburtz
Pages: 146 - 155
doi>10.1145/36206.36197
Full text: Pdf

The G-machine is a language-directed processor architecture designed to support graph reduction as a model of computation. It can carry out lazy evaluation of functional language programs and can evaluate programs in which logical variables are used. ... expand
Design tradeoffs to support the C programming language in the CRISP microprocessor
David R. Ditzel, Hubert R. McLellan, Alan D. Berenbaum
Pages: 158 - 163
doi>10.1145/36206.36198
Full text: Pdf
Firefly: a multiprocessor workstation
Charles P. Thacker, Lawrence C. Stewart
Pages: 164 - 172
doi>10.1145/36206.36199
Full text: Pdf

Firefly is a shared-memory multiprocessor workstation that contains from one to seven MicroVAX 78032 processors, each with a floating point unit and a sixteen kilobyte cache. The caches are coherent, so that all processors see a consistent view of main ... expand
Pipelining and performance in the VAX 8800 processor
Douglas W. Clark
Pages: 173 - 177
doi>10.1145/36206.36200
Full text: Pdf

The VAX 8800 family (models 8800, 8700, 8550), currently the fastest computers in the VAX product line, achieve their speed through a combination of fast cycle time and deep pipelining. Rather than pipeline highly variable VAX instructions as such, the ... expand
A VLIW architecture for a trace scheduling compiler
Robert P. Colwell, Robert P. Nix, John J. O'Donnell, David B. Papworth, Paul K. Rodman
Pages: 180 - 192
doi>10.1145/36206.36201
Full text: Pdf

Very Long Instruction Word (VLIW) architectures were promised to deliver far more than the factor of two or three that current architectures achieve from overlapped execution. Using a new type of compiler which compacts ordinary sequential code into ... expand
Parallel computers for graphics applications
Adam Levinthal, Pat Hanrahan, Mike Paquette, Jim Lawson
Pages: 193 - 198
doi>10.1145/36206.36202
Full text: Pdf

Specialized computer architectures can provide better price/performance for executing image processing and graphics applications than general purpose designs. Two processors are presented that use parallel SIMD data paths to support common graphics data ... expand
The ZS-1 central processor
J. E. Smith, G. E. Dermer, B. D. Vanderwarn, S. D. Klinger, C. M. Rozewski
Pages: 199 - 204
doi>10.1145/36206.36203
Full text: Pdf

The Astronautics ZS-1 is a high speed, 64-bit computer system designed for scientific and engineering applications. The ZS-1 central processor uses a decoupled architecture, which splits instructions into two streams---one for fixed point/memory address ... expand

1989

Architecture and compiler tradeoffs for a long instruction wordprocessor
Robert Cohn, Thomas Gross, Monica Lam
Pages: 2 - 14
doi>10.1145/70082.68183
Full text: Pdf

A very long instruction word (VLIW) processor exploits parallelism by controlling multiple operations in a single instruction word. This paper describes the architecture and compiler tradeoffs in the design of iWarp, a VLIW single-chip microprocessor ... expand
Tradeoffs in instruction format design for horizontal architectures
Gurindar S. Sohi, Sriram Vajapeyam
Pages: 15 - 25
doi>10.1145/70082.68184
Full text: Pdf

With recent improvements in software techniques and the enhanced level of fine grain parallelism made available by such techniques, there has been an increased interest in horizontal architectures and large instruction words that are capable of issuing ... expand
Overlapped loop support in the Cydra 5
James C. Dehnert, Peter Y.-T. Hsu, Joseph P. Bratt
Pages: 26 - 38
doi>10.1145/70082.68185
Full text: Pdf

The CydraTM 5 architecture adds unique support for overlapping successive iterations of a loop to a very long instruction word (VLIW) base. This architecture allows highly parallel loop execution for a much larger ... expand
Architectural support for synchronous task communication
F. J. Burkowski, G. V. Cormack, G. D. P. Dueck
Pages: 40 - 53
doi>10.1145/70082.68186
Full text: Pdf

This paper describes the motivation for a set of intertask communication primitives, the hardware support of these primitives, the architecture used in the Sylvan project which studies these issues, and the experience gained from various experiments ... expand
The fuzzy barrier: a mechanism for high speed synchronization of processors
Rajiv Gupta
Pages: 54 - 63
doi>10.1145/70082.68187
Full text: Pdf

Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared ... expand
Efficient synchronization primitives for large-scale cache-coherent multiprocessors
James R. Goodman, Mary K. Vernon, Philip J. Woest
Pages: 64 - 75
doi>10.1145/70082.68188
Full text: Pdf

This paper proposes a set of efficient primitives for process synchronization in multiprocessors. The only assumptions made in developing the set of primitives are that hardware combining is not implemented in the inter-connect, and (in one case) that ... expand
A software instruction counter
J. M. Mellor-Crummey, T. J. LeBlanc
Pages: 78 - 86
doi>10.1145/70082.68189
Full text: Pdf

Although several recent papers have proposed architectural support for program debugging and profiling, most processors do not yet provide even basic facilities, such as an instruction counter. As a result, system developers have been forced to invent ... expand
Efficient debugging primitives for multiprocessors
Z. Aral, I. Gerther, G. Schaffer
Pages: 87 - 95
doi>10.1145/70082.68190
Full text: Pdf

Existing kernel-level debugging primitives are inappropriate for instrumenting complex sequential or parallel programs. These functions incur a heavy overhead in their use of system calls and process switches. Context switches are used to alternately ... expand
Sheaved memory: architectural support for state saving and restoration in pages systems
M. E. Staknis
Pages: 96 - 102
doi>10.1145/70082.68191
Full text: Pdf

The concept of read-one/write-many paged memory is introduced and given the name sheaved memory. It is shown that sheaved memory is useful for efficiently maintaining checkpoints in main memory and for providing state saving and state ... expand
Reference history, page size, and migration daemons in local/remote architectures
M. A. Holliday
Pages: 104 - 112
doi>10.1145/70082.68192
Full text: Pdf

We address the problem of paged main memory management in the local/remote architecture subclass of shared memory multiprocessors. We consider the case where the operating system has primary responsibility and uses page migration as its main tool. We ... expand
Translation lookaside buffer consistency: a software approach
D. L. Black, R. F. Rashid, D. B. Golub, C. R. Hill
Pages: 113 - 122
doi>10.1145/70082.68193
Full text: Pdf

We discuss the translation lookaside buffer (TLB) consistency problem for multiprocessors, and introduce the Mach shootdown algorithm for maintaining TLB consistency in software. This algorithm has been implemented on several multiprocessors, and is ... expand
Failure correction techniques for large disk arrays
G. A. Gibson, L. Hellerstein, R. M. Karp, D. A. Patterson
Pages: 123 - 132
doi>10.1145/70082.68194
Full text: Pdf

The ever increasing need for I/O bandwidth will be met with ever larger arrays of disks. These arrays require redundancy to protect against data loss. This paper examines alternative choices for encodings, or codes, that reliably store information ... expand
A unified vector/scalar floating-point architecture
N. P. Jouppi, J. Bertoni, D. W. Wall
Pages: 134 - 143
doi>10.1145/70082.68195
Full text: Pdf

In this paper we present a unified approach to vector and scalar computation, using a single register file for both scalar operands and vector elements. The goal of this architecture is to yield improved scalar performance while broadening the range ... expand
Data buffering: run-time versus compile-time support
H. Mulder
Pages: 144 - 151
doi>10.1145/70082.68196
Full text: Pdf

Data-dependency, branch, and memory-access penalties are main constraints on the performance of high-speed microprocessors. The memory-access penalties concern both penalties imposed by external memory (e.g. cache) or by under utilization of the local ... expand
An analysis of 8086 instruction set usage in MS DOS programs
T. L. Adams, R. E. Zimmerman
Pages: 152 - 160
doi>10.1145/70082.68197
Full text: Pdf
A real-time support processor for ada tasking
J. Roos
Pages: 162 - 171
doi>10.1145/70082.68198
Full text: Pdf

Task synchronization in Ada causes excessive run-time overhead due to the complex semantics of the rendezvous. To demonstrate that the speed can be increased by two orders of magnitude by using special purpose hardware, a single chip VLSI support processor ... expand
The runtime environment for Scheme, a Scheme implementation on the 88000
Steven R. Vegdahl, Uwe F. Pleban
Pages: 172 - 182
doi>10.1145/70082.68199
Full text: Pdf

We are implementing a Scheme development system for the Motorola 88000. The core of the implementation is an optimizing native code compiler, together with a carefully designed runtime system. This paper describes our experiences with the 88000 as a ... expand
Program optimization for instruction caches
S. McFarling
Pages: 183 - 191
doi>10.1145/70082.68200
Full text: Pdf

This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and ... expand
Using registers to optimize cross-domain call performance
Paul A. Karger
Pages: 194 - 204
doi>10.1145/70082.68201
Full text: Pdf

This paper describes a new technique to improve the performance of cross-domain calls and returns in a capability-based computer system. Using register optimization information obtained from the compiler, a trusted linker can minimize the number of registers ... expand
The design of nectar: a network backplane for heterogeneous multicomputers
Emmanuel Arnould, H. T. Kung, Francois Bitz, Robert D. Sansom, Eric C. Cooperm
Pages: 205 - 216
doi>10.1145/70082.68202
Full text: Pdf

Nectar is a �network backplane� for use in heterogeneous multicomputers. The initial system consists of a star-shaped fiber-optic network with an aggregate bandwidth of 1.6 gigabits/second and a switching latency of 700 nanoseconds. The system ... expand
A message driven OR-parallel machine
S. A. Delgado-Rannauro, T. J. Reynolds
Pages: 217 - 228
doi>10.1145/70082.68203
Full text: Pdf

A message driven architecture for the execution of OR-parallel logic languages is proposed. The computational model is based on well known compilation techniques for Logic Languages. We present first the multiple binding mechanism for the OR-parallel ... expand
Evaluating the performance of software cache coherence
S. Owicki, A. Agarwal
Pages: 230 - 242
doi>10.1145/70082.68204
Full text: Pdf

In a shared-memory multiprocessor with private caches, cached copies of a data item must be kept consistent. This is called cache coherence. Both hardware and software coherence schemes have been proposed. Software techniques are attractive because they ... expand
Analysis of cache invalidation patterns in multiprocessors
W. Weber, A. Gupta
Pages: 243 - 256
doi>10.1145/70082.68205
Full text: Pdf

To make shared-memory multiprocessors scalable, researchers are now exploring cache coherence protocols that do not rely on broadcast, but instead send invalidation messages to individual caches that contain stale data. The feasibility of such directory-based ... expand
The effect of sharing on the cache and bus performance of parallel programs
S. J. Eggers, R. H. Katz
Pages: 257 - 270
doi>10.1145/70082.68206
Full text: Pdf

Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. ... expand
Available instruction-level parallelism for superscalar and superpipelined machines
N. P. Jouppi, D. W. Wall
Pages: 272 - 282
doi>10.1145/70082.68207
Full text: Pdf

Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to ... expand
Micro-optimization of floating-point operations
W. J. Dally
Pages: 283 - 289
doi>10.1145/70082.68208
Full text: Pdf

This paper describes micro-optimization, a technique for reducing the operation count and time required to perform floating-point calculations. Micro-optimization involves breaking floating-point operations into their constituent micro-operations and ... expand
Limits on multiple instruction issue
M. D. Smith, M. Johnson, M. A. Horowitz
Pages: 290 - 302
doi>10.1145/70082.68209
Full text: Pdf

This paper investigates the limitations on designing a processor which can sustain an execution rate of greater than one instruction per cycle on highly-optimized, non-scientific applications. We have used trace-driven simulations to determine that these ... expand

1991


A variable instruction stream extension to the VLIW architecture
Andrew Wolfe, John P. Shen
Pages: 2 - 14
doi>10.1145/106972.106976
Full text: Pdf
Reducing the branch penalty by rearranging instructions in a double-width memory
Manolis Katevenis, Nestoras Tzartzanis
Pages: 15 - 27
doi>10.1145/106972.106977
Full text: Pdf
The floating point performance of a superscalar SPARC processor
Roland L. Lee, Alex Y. Kwok, Fay� A. Briggs
Pages: 28 - 37
doi>10.1145/106972.106978
Full text: Pdf
Software prefetching
David Callahan, Ken Kennedy, Allan Porterfield
Pages: 40 - 52
doi>10.1145/106972.106979
Full text: Pdf
High-bandwidth data memory systems for superscalar processors
Gurindar S. Sohi, Manoj Franklin
Pages: 53 - 62
doi>10.1145/106972.106980
Full text: Pdf
The cache performance and optimizations of blocked algorithms
Monica D. Lam, Edward E. Rothberg, Michael E. Wolf
Pages: 63 - 74
doi>10.1145/106972.106981
Full text: Pdf
The effect of context switches on cache performance
Jeffrey C. Mogul, Anita Borg
Pages: 75 - 84
doi>10.1145/106972.106982
Full text: Pdf
A portable interface for on-the-fly instruction space modification
David Keppel
Pages: 86 - 95
doi>10.1145/106972.106983
Full text: Pdf
Virtual memory primitives for user programs
Andrew W. Appel, Kai Li
Pages: 96 - 107
doi>10.1145/106972.106984
Full text: Pdf
The interaction of architecture and operating system design
Thomas E. Anderson, Henry M. Levy, Brian N. Bershad, Edward D. Lazowska
Pages: 108 - 120
doi>10.1145/106972.106985
Full text: Pdf
Integrating register allocation and instruction scheduling for RISCs
David G. Bradlee, Susan J. Eggers, Robert R. Henry
Pages: 122 - 131
doi>10.1145/106972.106986
Full text: Pdf
Code generation for streaming: an access/execute mechanism
Manuel E. Benitez, Jack W. Davidson
Pages: 132 - 141
doi>10.1145/106972.106987
Full text: Pdf
Efficient Implementation of high-level parallel programs
Rajive Bagrodia, Sharad Mathur
Pages: 142 - 151
doi>10.1145/106972.376053
Full text: Pdf
Vector register design for polycyclic vector scheduling
William Mangione-Smith, Santosh G. Abraham, Edward S. Davidson
Pages: 154 - 163
doi>10.1145/106972.328664
Full text: Pdf
Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine
David E. Culler, Anurag Sah, Klaus E. Schauser, Thorsten von Eicken, John Wawrzynek
Pages: 164 - 175
doi>10.1145/106972.106990
Full text: Pdf
Limits of instruction-level parallelism
David W. Wall
Pages: 176 - 188
doi>10.1145/106972.106991
Full text: Pdf
Performance consequences of parity placement in disk arrays
Edward K. Lee, Randy H. Katz
Pages: 190 - 199
doi>10.1145/106972.106992
Full text: Pdf
Combining the concepts of compression and caching for a two-level filesystem
Vincent Cate, Thomas Gross
Pages: 200 - 211
doi>10.1145/106972.106993
Full text: Pdf
NUMA policies and their relation to memory architecture
William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, Alan L. Cox
Pages: 212 - 221
doi>10.1145/106972.106994
Full text: Pdf
LimitLESS directories: A scalable cache coherence scheme
David Chaiken, John Kubiatowicz, Anant Agarwal
Pages: 224 - 234
doi>10.1145/106972.106995
Full text: Pdf
An efficient cache-based access anomaly detection scheme
Sang L. Min, Jong-Deok Choi
Pages: 235 - 244
doi>10.1145/106972.106996
Full text: Pdf
Performance evaluation of memory consistency models for shared-memory multiprocessors
Kourosh Gharachorloo, Anoop Gupta, John Hennessy
Pages: 245 - 257
doi>10.1145/106972.106997
Full text: Pdf
Process coordination with fetch-and-increment
Eric Freudenthal, Allan Gottlieb
Pages: 260 - 268
doi>10.1145/106972.106998
Full text: Pdf
Synchronization without contention
John M. Mellor-Crummey, Michael L. Scott
Pages: 269 - 278
doi>10.1145/106972.106999
Full text: Pdf
The case for a read barrier
Douglas Johnson
Pages: 279 - 287
doi>10.1145/106972.107000
Full text: Pdf
An analysis of MIPS and SPARC instruction set utilization on the SPEC benchmarks
Robert F. Cmelik, Shing I. Kong, David R. Ditzel, Edmund J. Kelly
Pages: 290 - 302
doi>10.1145/106972.107001
Full text: Pdf
Performance characteristics of architectural features of the IBM RISC System/6000
C. Brian Hall, Kevin O'Brien
Pages: 303 - 309
doi>10.1145/106972.107002
Full text: Pdf
Performance from architecture: comparing a RISC and a CISC with similar hardware organization
Dileep Bhandarkar, Douglas W. Clark
Pages: 310 - 319
doi>10.1145/106972.107003
Full text: Pdf

1992

	On-line data compression in a log-structured file system
Michael Burrows, Charles Jerian, Butler Lampson, Timothy Mann
Pages: 2 - 9
doi>10.1145/143365.143376
Full text: Pdf
Non-volatile memory for fast, reliable file systems
Mary Baker, Satoshi Asami, Etienne Deprit, John Ouseterhout, Margo Seltzer
Pages: 10 - 22
doi>10.1145/143365.143380
Full text: Pdf
Parity declustering for continuous operation in redundant disk arrays
Mark Holland, Garth A. Gibson
Pages: 23 - 35
doi>10.1145/143365.143383
Full text: Pdf
Software support for speculative loads
Anne Rogers, Kai Li
Pages: 38 - 50
doi>10.1145/143365.143484
Full text: Pdf
Reducing memory latency via non-blocking and prefetching caches
Tien-Fu Chen, Jean-Loup Baer
Pages: 51 - 61
doi>10.1145/143365.143486
Full text: Pdf
Design and evaluation of a compiler algorithm for prefetching
Todd C. Mowry, Monica S. Lam, Anoop Gupta
Pages: 62 - 73
doi>10.1145/143365.143488
Full text: Pdf
Improving the accuracy of dynamic branch prediction using branch correlation
Shien-Tai Pan, Kimming So, Joseph T. Rahmeh
Pages: 76 - 84
doi>10.1145/143365.143490
Full text: Pdf
Predicting conditional branch directions from previous runs of a program
Joseph A. Fisher, Stefan M. Freudenberger
Pages: 85 - 95
doi>10.1145/143365.143493
Full text: Pdf
High speed switch scheduling for local area networks
Thomas E. Anderson, Susan S. Owicki, James B. Saxe, Charles P. Thacker
Pages: 98 - 110
doi>10.1145/143365.143495
Full text: Pdf
A tightly-coupled processor-network interface
Dana S. Henry, Christopher F. Joerg
Pages: 111 - 122
doi>10.1145/143365.143497
Full text: Pdf
Consistency management for virtually indexed caches
Bob Wheeler, Brian N. Bershad
Pages: 124 - 136
doi>10.1145/143365.143499
Full text: Pdf
Eliminating the address translation bottleneck for physical address cache
Tzi-cker Chiueh, Randy H. Katz
Pages: 137 - 148
doi>10.1145/143365.143501
Full text: Pdf
A performance evaluation of optimal hybrid cache coherency protocols
Jack E. Veenstra, Robert J. Fowler
Pages: 149 - 160
doi>10.1145/143365.143503
Full text: Pdf
Characterizing the caching and synchronization performance of a multiprocessor operating system
Josep Torrellas, Anoop Gupta, John Hennessy
Pages: 162 - 174
doi>10.1145/143365.143506
Full text: Pdf
Architecture support for single address space operating systems
Eric J. Koldinger, Jeffrey S. Chase, Susan J. Eggers
Pages: 175 - 186
doi>10.1145/143365.143508
Full text: Pdf
Application-controlled physical memory using external page-cache management
Kieran Harty, David R. Cheriton
Pages: 187 - 197
doi>10.1145/143365.143511
Full text: Pdf
Efficient data breakpoints
Robert Wahbe
Pages: 200 - 212
doi>10.1145/143365.143518
Full text: Pdf
Migrating a CISC computer family onto RISC via object code translation
Kristy Andrews, Duane Sand
Pages: 213 - 222
doi>10.1145/143365.143520
Full text: Pdf
Fast mutual exclusion for uniprocessors
Brian N. Bershad, David D. Redell, John R. Ellis
Pages: 223 - 233
doi>10.1145/143365.143523
Full text: Pdf

In this paper we describe restartable atomic sequences, an optimistic mechanism for implementing simple atomic operations (such as Test-And-Set) on a uniprocessor. A thread that is suspended within a restartable atomic ... expand
Sentinel scheduling for VLIW and superscalar processors
Scott A. Mahlke, William Y. Chen, Wen-mei W. Hwu, B. Ramakrishna Rau, Michael S. Schlansker
Pages: 238 - 247
doi>10.1145/143365.143529
Full text: Pdf

Speculative execution is an important source of parallelism for VLIW and superscalar processors. A serious challenge with compiler-controlled speculative execution is to accurately detect and report all program execution errors at the time of occurrence. ... expand
Efficient superscalar performance through boosting
Michael D. Smith, Mark Horowitz, Monica S. Lam
Pages: 248 - 259
doi>10.1145/143365.143534
Full text: Pdf

The foremost goal of superscalar processor design is to increase performance through the exploitation of instruction-level parallelism (ILP). Previous studies have shown that speculative execution is required for high instruction per cycle (IPC) rates ... expand
Cooperative shared memory: software and hardware for scalable multiprocessor
Mark D. Hill, James R. Larus, Steven K. Reinhardt, David A. Wood
Pages: 262 - 273
doi>10.1145/143365.143537
Full text: Pdf

We believe the absence of massively-parallel, shared-memory machines follows from the lack of a shared-memory programming performance model that can inform programmers of the cost of operations (so they can avoid expensive ones) and can tell hardware ... expand
Closing the window of vulnerability in multiphase memory transactions
John Kubiatowicz, David Chaiken, Anant Agarwal
Pages: 274 - 284
doi>10.1145/143365.143540
Full text: Pdf

Multiprocessor architects have begun to explore several mechanisms such as prefetching, context-switching and software-assisted dynamic cache-coherence, which transform single-phase memory transactions in conventional memory systems into multiphase operations. ... expand
Access normalization: loop restructuring for NUMA compilers
Wei Li, Keshav Pingali
Pages: 285 - 295
doi>10.1145/143365.143541
Full text: Pdf

In scalable parallel machines, processors can make local memory accesses much faster than they can make remote memory accesses. In addition, when a number of remote accesses must be made, it is usually more efficient to use block transfers of data rather ... expand

1994


Separating data and control transfer in distributed operating systems
Chandramohan A. Thekkath, Henry M. Levy, Edward D. Lazowska
Pages: 2 - 11
doi>10.1145/195473.195481
Full text: Pdf

Advances in processor architecture and technology have resulted in workstations in the 100+ MIPS range. As well, newer local-area networks such as ATM promise a ten- to hundred-fold increase in throughput, much reduced latency, greater scalability, and ... expand
Scheduling and page migration for multiprocessor compute servers
Rohit Chandra, Scott Devine, Ben Verghese, Anoop Gupta, Mendel Rosenblum
Pages: 12 - 24
doi>10.1145/195473.195485
Full text: Pdf

Several cache-coherent shared-memory multiprocessors have been developed that are scalable and offer a very tight coupling between the processing resources. They are therefore quite attractive for use as compute servers for multiprogramming and parallel ... expand
Reactive synchronization algorithms for multiprocessors
Beng-Hong Lim, Anant Agarwal
Pages: 25 - 35
doi>10.1145/195473.195490
Full text: Pdf

Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable run-time factors. The designer of a synchronization algorithm has a choice ... expand
Integration of message passing and shared memory in the Stanford FLASH multiprocessor
John Heinlein, Kourosh Gharachorloo, Scott Dresser, Anoop Gupta
Pages: 38 - 50
doi>10.1145/195473.195494
Full text: Pdf

The advantages of using message passing over shared memory for certain types of communication and synchronization have provided an incentive to integrate both models within a single architecture. A key goal of the FLASH (FLexible Architecture for SHared ... expand
Software overhead in messaging layers: where does the time go?
Vijay Karamcheti, Andrew A. Chien
Pages: 51 - 60
doi>10.1145/195473.195499
Full text: Pdf

Despite improvements in network interfaces and software messaging layers, software communication overhead still dominates the hardware routing cost in most systems. In this study, we identify the sources of this overhead by analyzing software costs of ... expand
Where is time spent in message-passing and shared-memory programs?
Satish Chandra, James R. Larus, Anne Rogers
Pages: 61 - 73
doi>10.1145/195473.195501
Full text: Pdf

Message passing and shared memory are two techniques parallel programs use for coordination and communication. This paper studies the strengths and weaknesses of these two mechanisms by comparing equivalent, well-written message-passing and shared-memory ... expand
Performance of a hardware-assisted real-time garbage collector
William J. Schmidt, Kelvin D. Nilsen
Pages: 76 - 85
doi>10.1145/195473.195504
Full text: Pdf

Hardware-assisted real-time garbage collection offers high throughput and small worst-case bounds on the times required to allocate dynamic objects and to access the memory contained within previously allocated objects. Whether the proposed technology ... expand
eNVy: a non-volatile, main memory storage system
Michael Wu, Willy Zwaenepoel
Pages: 86 - 97
doi>10.1145/195473.195506
Full text: Pdf

This paper describes the architecture of eNVy, a large non-volatile main memory storage system built primarily with Flash memory. eNVy presents its storage space as a linear, memory mapped array rather than as an emulated disk in order to provide an ... expand
Resource allocation in a high clock rate microprocessor
Michael Upton, Thomas Huff, Trevor Mudge, Richard Brown
Pages: 98 - 109
doi>10.1145/195473.195510
Full text: Pdf

This paper discusses the design of a high clock rate (300MHz) processor. The architecture is described, and the goals for the design are explained. The performance of three processor models is evaluated using trace-driven simulation. A cost model is ... expand
Hardware and software support for efficient exception handling
Chandramohan A. Thekkath, Henry M. Levy
Pages: 110 - 119
doi>10.1145/195473.195515
Full text: Pdf

Program-synchronous exceptions, for example, breakpoints, watchpoints, illegal opcodes, and memory access violations, provide information about exceptional conditions, interrupting the program and vectoring to an operating system handler. ... expand
A technique for monitoring run-time dynamics of an operating system and a microprocessor executing user applications
Pramod V. Argade, David K. Charles, Craig Taylor
Pages: 122 - 131
doi>10.1145/195473.195518
Full text: Pdf

In this paper, we present a non-invasive and efficient technique for simulating applications complete with their operating system interaction. The technique involves booting and initiating an application on a hardware development system, capturing the ... expand
Trap-driven simulation with Tapeworm II
Richard Uhlig, David Nagle, Trevor Mudge, Stuart Sechrest
Pages: 132 - 144
doi>10.1145/195473.195521
Full text: Pdf

Tapeworm II is a software-based simulation tool that evaluates the cache and TLB performance of multiple-task and operating system intensive workloads. Tapeworm resides in an OS kernel and causes a host machine's hardware to drive simulations with kernel ... expand
Contrasting characteristics and cache performance of technical and multi-user commercial workloads
Ann Marie Grizzaffi Maynard, Colette M. Donnelly, Bret R. Olszewski
Pages: 145 - 156
doi>10.1145/195473.195524
Full text: Pdf

Experience has shown that many widely used benchmarks are poor predictors of the performance of systems running commercial applications. Research into this anomaly has long been hampered by a lack of address traces from representative multi-user commercial ... expand
Avoiding conflict misses dynamically in large direct-mapped caches
Brian N. Bershad, Dennis Lee, Theodore H. Romer, J. Bradley Chen
Pages: 158 - 170
doi>10.1145/195473.195527
Full text: Pdf

This paper describes a method for improving the performance of a large direct-mapped cache by reducing the number of conflict misses. Our solution consists of two components: an inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that ... expand
Surpassing the TLB performance of superpages with less operating system support
Madhusudhan Talluri, Mark D. Hill
Pages: 171 - 182
doi>10.1145/195473.195531
Full text: Pdf

Many commercial microprocessor architectures have added translation lookaside buffer (TLB) support for superpages. Superpages differ from segments because their size must be a power of two multiple of the base page size ... expand
Dynamic memory disambiguation using the memory conflict buffer
David M. Gallagher, William Y. Chen, Scott A. Mahlke, John C. Gyllenhaal, Wen-mei W. Hwu
Pages: 183 - 193
doi>10.1145/195473.195534
Full text: Pdf

To exploit instruction level parallelism, compilers for VLIW and superscalar processors often employ static code scheduling. However, the available code reordering may be severely restricted due to ambiguous dependences between memory instructions. This ... expand
AP1000+: architectural support of PUT/GET interface for parallelizing compiler
Kenichi Hayashi, Tsunehisa Doi, Takeshi Horie, Yoichi Koyanagi, Osamu Shiraki, Nobutaka Imamura, Toshiyuki Shimizu, Hiroaki Ishihata, Tatsuya Shindo
Pages: 196 - 207
doi>10.1145/195473.195538
Full text: Pdf

The scalability of distributed-memory parallel computers makes them attractive candidates for solving large-scale problems. New languages, such as HPF, FortranD, and VPP Fortran, have been developed to enable existing software to be easily ported to ... expand
LCM: memory system support for parallel language implementation
James R. Larus, Brad Richards, Guhan Viswanathan
Pages: 208 - 218
doi>10.1145/195473.195545
Full text: Pdf

Higher-level parallel programming languages can be difficult to implement efficiently on parallel machines. This paper shows how a flexible, compiler-controlled memory system can help achieve good performance for language constructs that previously appeared ... expand
The performance advantages of integrating block data transfer in cache-coherent multiprocessors
Steven Cameron Woo, Jaswinder Pal Singh, John L. Hennessy
Pages: 219 - 229
doi>10.1145/195473.195547
Full text: Pdf

Integrating support for block data transfer has become an important emphasis in recent cache-coherent shared address space multiprocessors. This paper examines the potential performance benefits of adding this support. A set of ambitious hardware mechanisms ... expand
Improving the accuracy of static branch prediction using branch correlation
Cliff Young, Michael D. Smith
Pages: 232 - 241
doi>10.1145/195473.195549
Full text: Pdf

Recent work in history-based branch prediction uses novel hardware structures to capture branch correlation and increase branch prediction accuracy. We present a profile-based code transformation that exploits branch correlation to improve the accuracy ... expand
Reducing branch costs via branch alignment
Brad Calder, Dirk Grunwald
Pages: 242 - 251
doi>10.1145/195473.195553
Full text: Pdf

Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned ... expand
Compiler optimizations for improving data locality
Steve Carr, Kathryn S. McKinley, Chau-Wen Tseng
Pages: 252 - 262
doi>10.1145/195473.195557
Full text: Pdf

In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, ... expand
DCG: an efficient, retargetable dynamic code generation system
Dawson R. Engler, Todd A. Proebsting
Pages: 263 - 272
doi>10.1145/195473.195567
Full text: Pdf

Dynamic code generation allows aggressive optimization through the use of runtime information. Previous systems typically relied on ad hoc code generators that were not designed for retargetability, and did not shield the client from machine-specific ... expand
The performance impact of flexibility in the Stanford FLASH multiprocessor
Mark Heinrich, Jeffrey Kuskin, David Ofelt, John Heinlein, Joel Baxter, Jaswinder Pal Singh, Richard Simoni, Kourosh Gharachorloo, David Nakahira, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, John Hennessy
Pages: 274 - 285
doi>10.1145/195473.195569
Full text: Pdf

A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford ... expand
Simple compiler algorithms to reduce ownership overhead in cache coherence protocols
Jonas Skeppstedt, Per Stenstr�m
Pages: 286 - 296
doi>10.1145/195473.195572
Full text: Pdf

We study in this paper the design and efficiency of compiler algorithms that remove ownership overhead in shared-memory multiprocessors with write-invalidate protocols. These algorithms detect loads followed by stores to the same address. Such loads ... expand
Fine-grain access control for distributed shared memory
Ioannis Schoinas, Babak Falsafi, Alvin R. Lebeck, Steven K. Reinhardt, James R. Larus, David A. Wood
Pages: 297 - 306
doi>10.1145/195473.195575
Full text: Pdf

This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper ... expand
Interleaving: a multithreading technique targeting multiprocessors and workstations
James Laudon, Anoop Gupta, Mark Horowitz
Pages: 308 - 318
doi>10.1145/195473.195576
Full text: Pdf

There is an increasing trend to use commodity microprocessors as the compute engines in large-scale multiprocessors. However, given that the majority of the microprocessors are sold in the workstation market, not in the multiprocessor market, it is only ... expand
Hardware support for fast capability-based addressing
Nicholas P. Carter, Stephen W. Keckler, William J. Dally
Pages: 319 - 327
doi>10.1145/195473.195579
Full text: Pdf

Traditional methods of providing protection in memory systems do so at the cost of increased context switch time and/or increased storage to record access permissions for processes. With the advent of computers that supported cycle-by-cycle multithreading, ... expand
The effectiveness of multiple hardware contexts
Radhika Thekkath, Susan J. Eggers
Pages: 328 - 337
doi>10.1145/195473.195583
Full text: Pdf

Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread working ... expand

1996

	The case for a single-chip multiprocessor
Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, Kunyung Chang
Pages: 2 - 11
doi>10.1145/237090.237140
Full text: Pdf

Advances in IC processing allow for more microprocessor design options. The increasing gate density and cost of wires in advanced integrated circuit technologies require that we look for new ways to use their capabilities effectively. This paper shows ... expand
An evaluation of memory consistency models for shared-memory systems with ILP processors
Vijay S. Pai, Parthasarathy Ranganathan, Sarita V. Adve, Tracy Harton
Pages: 12 - 23
doi>10.1145/237090.237142
Full text: Pdf

Relaxed consistency models have been shown to significantly outperform sequential consistency for single-issue, statically scheduled processors with blocking reads. However, current microprocessors aggressively exploit instruction-level parallelism (ILP) ... expand
Synchronization and communication in the T3E multiprocessor
Steven L. Scott
Pages: 26 - 36
doi>10.1145/237090.237144
Full text: Pdf

This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale ... expand
Evaluation of architectural support for global address-based communication in large-scale parallel machines
Arvind Krishnamurthy, Klaus E. Schauser, Chris J. Scheiman, Randolph Y. Wang, David E. Culler, Katherine Yelick
Pages: 37 - 48
doi>10.1145/237090.237147
Full text: Pdf

Large-scale parallel machines are incorporating increasingly sophisticated architectural support for user-level messaging and global memory access. We provide a systematic evaluation of a broad spectrum of current design alternatives based on our implementations ... expand
Whole-program optimization for time and space efficient threads
Dirk Grunwald, Richard Neves
Pages: 50 - 59
doi>10.1145/237090.237149
Full text: Pdf

Modern languages and operating systems often encourage programmers to use threads, or independent control streams, to mask the overhead of some operations and simplify program structure. Multitasking operating systems use threads to mask communication ... expand
Thread scheduling for cache locality
James Philbin, Jan Edler, Otto J. Anshus, Craig C. Douglas, Kai Li
Pages: 60 - 71
doi>10.1145/237090.237151
Full text: Pdf

This paper describes a method to improve the cache locality of sequential programs by scheduling fine-grained threads. The algorithm relies upon hints provided at the time of thread creation to determine a thread execution order likely to reduce cache ... expand
The Rio file cache: surviving operating system crashes
Peter M. Chen, Wee Teck Ng, Subhachandra Chandra, Christopher Aycock, Gurushankar Rajamani, David Lowell
Pages: 74 - 83
doi>10.1145/237090.237154
Full text: Pdf

One of the fundamental limits to high-performance, high-reliability file systems is memory's vulnerability to system crashes. Because memory is viewed as unsafe, systems periodically write data back to disk. The extra disk traffic lowers performance, ... expand
Petal: distributed virtual disks
Edward K. Lee, Chandramohan A. Thekkath
Pages: 84 - 92
doi>10.1145/237090.237157
Full text: Pdf

The ideal storage system is globally accessible, always available, provides unlimited performance and capacity for a large number of clients, and requires no management. This paper describes the design, implementation, and performance of Petal, a system ... expand
A quantitative analysis of loop nest locality
Kathryn S. McKinley, Olivier Temam
Pages: 94 - 104
doi>10.1145/237090.237161
Full text: Pdf

This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority ... expand
The intrinsic bandwidth requirements of ordinary programs
Andrew S. Huang, John Paul Shen
Pages: 105 - 114
doi>10.1145/237090.237163
Full text: Pdf

While there has been an abundance of recent papers on hardware and software approaches to improving the performance of memory accesses, few papers have addressed the problem from the program's point of view. There is a general notion that certain programs ... expand
Multiple-block ahead branch predictors
Andr� Seznec, St�phan Jourdan, Pascal Sainrat, Pierre Michaud
Pages: 116 - 127
doi>10.1145/237090.237169
Full text: Pdf

A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel cost-effective mechanism called the two-block ahead branch predictor. Information from the current ... expand
Analysis of branch prediction via data compression
I-Cheng K. Chen, John T. Coffey, Trevor N. Mudge
Pages: 128 - 137
doi>10.1145/237090.237171
Full text: Pdf

Branch prediction is an important mechanism in modern microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the theoretical basis behind these prediction schemes. ... expand
Value locality and load value prediction
Mikko H. Lipasti, Christopher B. Wilkerson, John Paul Shen
Pages: 138 - 147
doi>10.1145/237090.237173
Full text: Pdf

Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, ... expand
The structure and performance of interpreters
Theodore H. Romer, Dennis Lee, Geoffrey M. Voelker, Alec Wolman, Wayne A. Wong, Jean-Loup Baer, Brian N. Bershad, Henry M. Levy
Pages: 150 - 159
doi>10.1145/237090.237175
Full text: Pdf

Interpreted languages have become increasingly popular due to demands for rapid program development, ease of use, portability, and safety. Beyond the general impression that they are "slow," however, little has been documented about the performance of ... expand
Adapting to network and client variability via on-demand dynamic distillation
Armando Fox, Steven D. Gribble, Eric A. Brewer, Elan Amir
Pages: 160 - 170
doi>10.1145/237090.237177
Full text: Pdf

The explosive growth of the Internet and the proliferation of smart cellular phones and handheld wireless devices is widening an already large gap between Internet clients. Clients vary in their hardware resources, software sophistication, and quality ... expand
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory
Daniel J. Scales, Kourosh Gharachorloo, Chandramohan A. Thekkath
Pages: 174 - 185
doi>10.1145/237090.237179
Full text: Pdf

This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared ... expand
An integrated compile-time/run-time software distributed shared memory system
Sandhya Dwarkadas, Alan L. Cox, Willy Zwaenepoel
Pages: 186 - 197
doi>10.1145/237090.237181
Full text: Pdf

On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into ... expand
Hiding communication latency and coherence overhead in software DSMs
R. Bianchini, L. I. Kontothanassis, R. Pinto, M. De Maria, M. Abud, C. L. Amorim
Pages: 198 - 209
doi>10.1145/237090.237185
Full text: Pdf

In this paper we propose the use of a PCI-based programmable protocol controller for hiding communication and coherence overheads in software DSMs. Our protocol controller provides three different types of overhead tolerance: a) moving basic communication ... expand
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory
Andrew Erlichson, Neal Nuckolls, Greg Chesson, John Hennessy
Pages: 210 - 220
doi>10.1145/237090.237187
Full text: Pdf

One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment ... expand
Compiler-based prefetching for recursive data structures
Chi-Keung Luk, Todd C. Mowry
Pages: 222 - 233
doi>10.1145/237090.237190
Full text: Pdf

Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. While prefetching has enjoyed considerable success in array-based numeric codes, ... expand
Exploiting dual data-memory banks in digital signal processors
Mazen A. R. Saghir, Paul Chow, Corinna G. Lee
Pages: 234 - 243
doi>10.1145/237090.237193
Full text: Pdf

Over the past decade, digital signal processors (DSPs) have emerged as the processors of choice for implementing embedded applications in high-volume consumer products. Through their use of specialized hardware features and small chip areas, DSPs provide ... expand
Compiler-directed page coloring for multiprocessors
Edouard Bugnion, Jennifer M. Anderson, Todd C. Mowry, Mendel Rosenblum, Monica S. Lam
Pages: 244 - 255
doi>10.1145/237090.237195
Full text: Pdf

This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. ... expand
Reducing network latency using subpages in a global memory environment
Herv� A. Jamrozik, Michael J. Feeley, Geoffrey M. Voelker, James Evans, II, Anna R. Karlin, Henry M. Levy, Mary K. Vernon
Pages: 258 - 267
doi>10.1145/237090.237198
Full text: Pdf

New high-speed networks greatly encourage the use of network memory as a cache for virtual memory and file pages, thereby reducing the need for disk access. Because pages are the fundamental transfer and access units in remote memory systems, page size ... expand
Improving cache performance with balanced tag and data paths
Jih-Kwon Peir, Windsor W. Hsu, Honesty Young, Shauchi Ong
Pages: 268 - 278
doi>10.1145/237090.237202
Full text: Pdf

There are two concurrent paths in a typical cache access --- one through the data array and the other through the tag array. The path through the data array drives the selected set out of the array. The path through the tag array determines cache hit/miss ... expand
Operating system support for improving data locality on CC-NUMA compute servers
Ben Verghese, Scott Devine, Anoop Gupta, Mendel Rosenblum
Pages: 279 - 289
doi>10.1145/237090.237205
Full text: Pdf

The dominant architecture for the next generation of shared-memory multiprocessors is CC-NUMA (cache-coherent non-uniform memory architecture). These machines are attractive as compute servers because they provide transparent access to local and remote ... expand

1998


Compiler-controlled memory
Keith D. Cooper, Timothy J. Harvey
Pages: 2 - 11
doi>10.1145/291069.291010
Full text: Pdf

Optimizations aimed at reducing the impact of memory operations on execution speed have long concentrated on improving cache performance. These efforts achieve a. reasonable level of success. The primary limit on the compiler's ability to improve memory ... expand
Segregating heap objects by reference behavior and lifetime
Matthew L. Seidl, Benjamin G. Zorn
Pages: 12 - 23
doi>10.1145/291069.291012
Full text: Pdf

Dynamic storage allocation has become increasingly important in many applications, in part due to the use of the object-oriented paradigm. At the same time, processor speeds are increasing faster than memory speeds and programs are increasing in size ... expand
Schedule-independent storage mapping for loops
Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Beth Simon
Pages: 24 - 33
doi>10.1145/291069.291015
Full text: Pdf

This paper studies the relationship between storage requirements and performance. Storage-related dependences inhibit optimizations for locality and parallelism. Techniques such as renaming and array expansion can eliminate all storage-related dependences, ... expand
An empirical analysis of instruction repetition
Avinash Sodani, Gurindar S. Sohi
Pages: 35 - 45
doi>10.1145/291069.291016
Full text: Pdf

We study the phenomenon of instruction repetition, where the inputs and outputs of multiple dynamic instances of a static instruction are repeated. We observe that over 80% of the dynamic instructions executed in several programs are repeated and most ... expand
Space-time scheduling of instruction-level parallelism on a raw machine
Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek Sarkar, Saman Amarasinghe
Pages: 46 - 57
doi>10.1145/291069.291018
Full text: Pdf

Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor distributes ... expand
Data speculation support for a chip multiprocessor
Lance Hammond, Mark Willey, Kunle Olukotun
Pages: 58 - 69
doi>10.1145/291069.291020
Full text: Pdf

Thread-level speculation is a technique that enables parallel execution of sequential applications on a multiprocessor. This paper describes the complete implementation of the support for threadlevel speculation on the Hydra chip multiprocessor (CMP). ... expand
VISA: Netstation's virtual Internet SCSI adapter
Rodney Van Meter, Gregory G. Finn, Steve Hotz
Pages: 71 - 80
doi>10.1145/291069.291023
Full text: Pdf

In this paper we describe the implementation of VISA, our Virtual Internet SCSI Adapter. VISA was built to evaluate the performance impact on the host operating system of using IP to communicate with peripherals, especially storage devices. We have built ... expand
Active disks: programming model, algorithms and evaluation
Anurag Acharya, Mustafa Uysal, Joel Saltz
Pages: 81 - 91
doi>10.1145/291069.291026
Full text: Pdf

Several application and technology trends indicate that it might be both profitable and feasible to move computation closer to the data that it processes. In this paper, we evaluate Active Disk architectures which integrate significant processing ... expand
A cost-effective, high-bandwidth storage architecture
Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang, Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, Jim Zelenka
Pages: 92 - 103
doi>10.1145/291069.291029
Full text: Pdf

This paper describes the Network-Attached Secure Disk (NASD) storage architecture, prototype implementations oj NASD drives, array management for our architecture, and three, filesystems built on our prototype. NASD provides scalable storage bandwidth ... expand
Hardware-software trade-offs in a direct Rambus implementation of the RAMpage memory hierarchy
Philip Machanick, Pierre Salverda, Lance Pompe
Pages: 105 - 114
doi>10.1145/291069.291032
Full text: Pdf

The RAMpage memory hierarchy is an alternative to the traditional division between cache and main memory: main memory is moved up a level and DRAM is used as a paging device. The idea behind RAMpage is to reduce hardware complexity, if at the cost of ... expand
Dependence based prefetching for linked data structures
Amir Roth, Andreas Moshovos, Gurindar S. Sohi
Pages: 115 - 126
doi>10.1145/291069.291034
Full text: Pdf

We introduce a dynamic scheme that captures the accesspat-terns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses ... expand
Performance counters and state sharing annotations: a unified approach to thread locality
Boris Weissman
Pages: 127 - 138
doi>10.1145/291069.291035
Full text: Pdf

This paper describes a combined approach for improving thread locality that uses the bardware performance monitors of modem processors and program-centric code annotations to guide thread scheduling on SMPs. The approach relies on a shared state cache ... expand
Cache-conscious data placement
Brad Calder, Chandra Krintz, Simmi John, Todd Austin
Pages: 139 - 149
doi>10.1145/291069.291036
Full text: Pdf

As the gap between memory and processor speeds continues to widen, cache eficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction cache pet$ormance by mapping code with temporal ... expand
An out-of-order execution technique for runtime binary translators
Bich C. Le
Pages: 151 - 158
doi>10.1145/291069.291039
Full text: Pdf

A dynamic translator emulates an instruction set architccturc by translating source instructions to native code during execution. On statically-scheduled hardware, higher performance can potentially be achieved by reordering the translated instructions; ... expand
Overlapping execution with transfer using non-strict execution for mobile programs
Chandra Krintz, Brad Calder, Han Bok Lee, Benjamin G. Zorn
Pages: 159 - 169
doi>10.1145/291069.291040
Full text: Pdf

In order to execute a program on a remote computer, it mustfirst be transferred over a network. This transmission incurs the over-head of network latency before execution can begin. This latency can vary greatly depending upon the size of the program., ... expand
Variable length path branch prediction
Jared Stark, Marius Evers, Yale N. Patt
Pages: 170 - 179
doi>10.1145/291069.291042
Full text: Pdf

Accurate branch prediction is required to achieve high performance in deeply pipelined, wide-issue processors. Recent studies have shown that conditional and indirect (or computed) branch targets can be accuratelypredicted by recording the path, which ... expand
Performance isolation: sharing and isolation in shared-memory multiprocessors
Ben Verghese, Anoop Gupta, Mendel Rosenblum
Pages: 181 - 192
doi>10.1145/291069.291044
Full text: Pdf

Shared-memory multiprocessors (SMPs) are being extensively used as general-purpose servers. The tight coupling of multiple processors, memory, and I/O provides enormous computing power in a single system, and enables the efficient sharing of these resources.The ... expand
UTLB: a mechanism for address translation on network interfaces
Yuqun Chen, Angelos Bilas, Stefanos N. Damianakis, Cezary Dubnicki, Kai Li
Pages: 193 - 204
doi>10.1145/291069.291046
Full text: Pdf

An important aspect of a high-speed network system is the ability to transfer data directly between the network interface and application buffers. Such a direct data path requires the network interface to "know" the virtual-to-physical address ... expand
Locality-aware request distribution in cluster-based network servers
Vivek S. Pai, Mohit Aron, Gaurov Banga, Michael Svendsen, Peter Druschel, Willy Zwaenepoel, Erich Nahum
Pages: 205 - 216
doi>10.1145/291069.291048
Full text: Pdf

We consider cluster-based network servers in which a front-end directs incoming requests to one of a number of back-ends. Specifically, we consider content-based request distribution: the front-end uses the content requested, in addition to information ... expand
Investigating optimal local memory performance
Olivier Temam
Pages: 218 - 227
doi>10.1145/291069.291050
Full text: Pdf

Recent work has demonstrated that, cache space is often poorly utilized. However, no previous work has yet demonstrated upper bounds on what a cache or local memory could achieve when exploiting both spatial and temporal locality. Belady's MIN algorithm ... expand
Precise miss analysis for program transformations with caches of arbitrary associativity
Somnath Ghosh, Margaret Martonosi, Sharad Malik
Pages: 228 - 239
doi>10.1145/291069.291051
Full text: Pdf

Analyzing and optimizing program memory performance is a pressing problem in high-performance computer architectures. Currently, software solutions addressing the processor-memory performance gap include compiler-or programmer-applied optimizations like ... expand
Capturing dynamic memory reference behavior with adaptive cache topology
Jih-Kwon Peir, Yongjoon Lee, Windsor W. Hsu
Pages: 240 - 250
doi>10.1145/291069.291053
Full text: Pdf

Memory references exhibit locality and are therefore not uniformly distributed across the sets of a cache. This skew reduces the effectiveness of a cache because it results in the caching of a considerable number of less-recently-used lines which are ... expand
Accelerating multi-media processing by implementing memoing in multiplication and division units
Daniel Citron, Dror Feitelson, Larry Rudolph
Pages: 252 - 261
doi>10.1145/291069.291056
Full text: Pdf

This paper proposes a technique that enables performing multi-cycle (multiplication, division, square-root &hellip;) computations in a single cycle. The technique is based on the notion of memoing: saving the input and output of previous calculations ... expand
Value speculation scheduling for high performance processors
Chao-Ying Fu, Matthew D. Jennings, Sergei Y. Larin, Thomas M. Conte
Pages: 262 - 271
doi>10.1145/291069.291058
Full text: Pdf

Recent research in value prediction shows a surprising amount of predictability for the values produced by register-writing instructions. Several hardware based value predictor designs have been proposed to exploit this predictability by eliminating ... expand
An empirical study of decentralized ILP execution models
Narayan Ranganathan, Manoj Franklin
Pages: 272 - 281
doi>10.1145/291069.291061
Full text: Pdf

Recent fascination for dynamic scheduling as a means for exploiting instruction-level parallelism has introduced significant interest in the scalability aspects of dynamic scheduling hardware. In order to overcome the scalability problems of centralized ... expand
Fast out-of-order processor simulation using memoization
Eric Schnarr, James R. Larus
Pages: 283 - 294
doi>10.1145/291069.291063
Full text: Pdf

Our new out-of-order processor simulatol; FastSim, uses two innovations to speed up simulation 8--15 times (vs. Wisconsin SimpleScalar) with no loss in simulation accuracy. First, FastSim uses speculative direct-execution to accelerate the functional ... expand
A look at several memory management units, TLB-refill mechanisms, and page table organizations
Bruce L. Jacob, Trevor N. Mudge
Pages: 295 - 306
doi>10.1145/291069.291065
Full text: Pdf

Virtual memory is a staple in modem systems, though there is little agreement on how its functionality is to be implemented on either the hardware or software side of the interface. The myriad of design choices and incompatible hardware mechanisms suggests ... expand
Performance of database workloads on shared-memory systems with out-of-order processors
Parthasarathy Ranganathan, Kourosh Gharachorloo, Sarita V. Adve, Luiz Andr� Barroso
Pages: 307 - 318
doi>10.1145/291069.291067
Full text: Pdf

Database applications such as online transaction processing (OLTP) and decision support systems (DSS) constitute the largest and fastest-growing segment of the market for multiprocessor servers. However, most current system designs have been optimized ... expand

2000

Designing computer systems with MEMS-based storage
Steven W. Schlosser, John Linwood Griffin, David F. Nagle, Gregory R. Ganger
Pages: 1 - 12
doi>10.1145/378993.378996
Full text: Pdf

For decades the RAM-to-disk memory hierarchy gap has plagued computer architects. An exciting new storage technology based on microelectromechanical systems (MEMS) is poised to fill a large portion of this performance gap, significantly reduce system ... expand
Architecture and design of AlphaServer GS320
Kourosh Gharachorloo, Madhu Sharma, Simon Steely, Stephen Van Doren
Pages: 13 - 24
doi>10.1145/378993.378997
Full text: Pdf

This paper describes the architecture and implementation of the AlphaServer GS320, a cache-coherent non-uniform memory access multiprocessor developed at Compaq. The AlphaServer GS320 architecture is specifically targeted at medium-scale multiprocessing ... expand
Timestamp snooping: an approach for extending SMPs
Milo M. K. Martin, Daniel J. Sorin, Anatassia Ailamaki, Alaa R. Alameldeen, Ross M. Dickson, Carl J. Mauer, Kevin E. Moore, Manoj Plakal, Mark D. Hill, David A. Wood
Pages: 25 - 36
doi>10.1145/378993.378998
Full text: Pdf

Symmetric muultiprocessor (SMP) servers provide superior performance for the commercial workloads that dominate the Internet. Our simulation results show that over one-third of cache misses by these applications result in cache-to-cache transfers, where ... expand
MemorIES3: a programmable, real-time hardware emulation tool for multiprocessor server design
Ashwini Nanda, Kwok-Ken Mak, Krishnan Sugarvanam, Ramendra K. Sahoo, Vijayaraghavan Soundarararjan, T. Basil Smith
Pages: 37 - 48
doi>10.1145/378993.378999
Full text: Pdf

Modern system design often requires multiple levels of simulation for design validation and performance debugging. However, while machines have gotten faster, and simulators have become more detailed, simulation speeds have not tracked machine speeds, ... expand
FLASH vs. (Simulated) FLASH: closing the simulation loop
Jeff Gibson, Robert Kunz, David Ofelt, Mark Horowitz, John Hennessy, Mark Heinrich
Pages: 49 - 58
doi>10.1145/378993.379000
Full text: Pdf

Simulation is the primary method for evaluating computer systems during all phases of the design process. One significant problem with simulation is that it rarely models the system exactly, and quantifying the resulting simulator error can be difficult. ... expand
Using meta-level compilation to check FLASH protocol code
Andy Chou, Benjamin Chelf, Dawson Engler, Mark Heinrich
Pages: 59 - 70
doi>10.1145/378993.379002
Full text: Pdf

Building systems such as OS kernels and embedded software is difficult. An important source of this difficulty is the numerous rules they must obey: interrupts cannot be disabled for ~too long," global variables must be protected by locks, user pointers ... expand
Evaluating design alternatives for reliable communication on high-speed networks
Raoul A. F. Bhoedjang, Kees Verstoep, Tim R�hl, Henri E. Bal, Rutger F. H. Hofman
Pages: 71 - 81
doi>10.1145/378993.379004
Full text: Pdf

We systematically evaluate the performance of five implementations of a single, user-level communication interface. Each implementation makes different architectural assumptions about the reliability of the network hardware and the capabilities of the ... expand
Communication scheduling
Peter Mattson, William J. Dally, Scott Rixner, Ujval J. Kapasi, John D. Owens
Pages: 82 - 92
doi>10.1145/378993.379005
Full text: Pdf

The high arithmetic rates of media processing applications require architectures with tens to hundreds of functional units, multiple register files, and explicit interconnect between functional units and register files. Communication scheduling enables ... expand
System architecture directions for networked sensors
Jason Hill, Robert Szewczyk, Alec Woo, Seth Hollar, David Culler, Kristofer Pister
Pages: 93 - 104
doi>10.1145/378993.379006
Full text: Pdf

Technological progress in integrated, low-power, CMOS communication devices and sensors makes a rich design space of networked sensors viable. They can be deeply embedded in the physical world and spread throughout our environment like smart dust. The ... expand
Power aware page allocation
Alvin R. Lebeck, Xiaobo Fan, Heng Zeng, Carla Ellis
Pages: 105 - 116
doi>10.1145/378993.379007
Full text: Pdf

One of the major challenges of post-PC computing is the need to reduce energy consumption, thereby extending the lifetime of the batteries that power these mobile devices. Memory is a particularly important target for efforts to improve energy efficiency. ... expand
Hoard: a scalable memory allocator for multithreaded applications
Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, Paul R. Wilson
Pages: 117 - 128
doi>10.1145/378993.379232
Full text: Pdf

Parallel, multithreaded C and C++ programs such as web servers, database managers, news servers, and scientific applications are becoming increasingly prevalent. For these applications, the memory allocator is often a bottleneck that severely limits ... expand
Thread-level parallelism and interactive performance of desktop applications
Kristi�n Flautner, Rich Uhlig, Steve Reinhardt, Trevor Mudge
Pages: 129 - 138
doi>10.1145/378993.379233
Full text: Pdf

Multiprocessing is already prevalent in servers where multiple clients present an obvious source of thread-level parallelism. However, the case for multiprocessing is less clear for desktop applications. Nevertheless, architects are designing processors ... expand
Effective null pointer check elimination utilizing hardware trap
Motohiro Kawahito, Hideaki Komatsu, Toshio Nakatani
Pages: 139 - 149
doi>10.1145/378993.379234
Full text: Pdf

We present a new algorithm for eliminating null pointer checks from programs written in Java&trade;. Our new algorithm is split into two phases. In the first phase, it moves null checks backward, and it is iterated for a few times with other optimizations ... expand
Frequent value locality and value-centric data cache design
Youtao Zhang, Jun Yang, Rajiv Gupta
Pages: 150 - 159
doi>10.1145/378993.379235
Full text: Pdf

By studying the behavior of programs in the SPECint95 suite we observed that six out of eight programs exhibit a new kind of value locality, the frequent value locality, according to which a few values appear very frequently in memory locations ... expand
Efficient and flexible value sampling
M. Burrows, U. Erlingsson, S-T. A. Leung, M. T. Vandevoorde, C. A. Waldspurger, K. Walker, W. E. Weihl
Pages: 160 - 167
doi>10.1145/378993.379236
Full text: Pdf

This paper presents novel sampling-based techniques for collecting statistical profiles of register contents, data values, and other information associated with instructions, such as memory latencies. Values of interest are sampled in response to periodic ... expand
Architectural support for copy and tamper resistant software
David Lie Chandramohan Thekkath, Mark Mitchell, Patrick Lincoln, Dan Boneh, John Mitchell, Mark Horowitz
Pages: 168 - 177
doi>10.1145/378993.379237
Full text: Pdf

Although there have been attempts to develop code transformations that yield tamper-resistant software, no reliable software-only methods are know. This paper studies the hardware implementation of a form of execute-only memory (XOM) that allows instructions ... expand
Architectural support for fast symmetric-key cryptography
Jerome Burke, John McDonald, Todd Austin
Pages: 178 - 189
doi>10.1145/378993.379238
Full text: Pdf

The emergence of the Internet as a trusted medium for commerce and communication has made cryptography an essential component of modern information systems. Cryptography provides the mechanisms necessary to implement accountability, accuracy, and confidentiality ... expand
OceanStore: an architecture for global-scale persistent storage
John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Chris Wells, Ben Zhao
Pages: 190 - 201
doi>10.1145/378993.379239
Full text: Pdf

OceanStore is a utility infrastructure designed to span the globe and provide continuous access to persistent information. Since this infrastructure is comprised of untrusted servers, data is protected through redundancy and cryptographic techniques. ... expand
Software profiling for hot path prediction: less is more
Evelyn Duesterwald, Vasanth Bala
Pages: 202 - 211
doi>10.1145/378993.379241
Full text: Pdf

Recently, there has been a growing interest in exploiting profile information in adaptive systems such as just-in-time compilers, dynamic optimizers and, binary translators. In this paper, we show that sophisticated software profiling schemes that provide ... expand
OS and compiler considerations in the design of the IA-64 architecture
Rumi Zahir, Jonathan Ross, Dale Morris, Drew Hess
Pages: 212 - 221
doi>10.1145/378993.379242
Full text: Pdf

Increasing demands for processor performance have outstripped the pace of process and frequency improvements, pushing designers to find ways of increasing the amount of work that can be processed in parallel. Traditional RISC architectures use hardware ... expand
Hardware support for dynamic activation of compiler-directed computation reuse
Daniel A. Connors, Hillery C. Hunter, Ben-Chung Cheng, Wen-mei W. Hwu
Pages: 222 - 233
doi>10.1145/378993.379243
Full text: Pdf

Compiler-directed Computation Reuse (CCR) enhances program execution speed and efficiency by eliminating dynamic computation redundancy. In this approach, the compiler designates large program regions for potential reuse. During run time, the execution ... expand
Symbiotic jobscheduling for a simultaneous multithreaded processor
Allan Snavely, Dean M. Tullsen
Pages: 234 - 244
doi>10.1145/378993.379244
Full text: Pdf

Simultaneous Multithreading machines fetch and execute instructions from multiple instruction streams to increase system utilization and speedup the execution of jobs. When there are more jobs in the system than there is hardware to support simultaneous ... expand
An analysis of operating system behavior on a simultaneous multithreaded architecture
Joshua A. Redstone, Susan J. Eggers, Henry M. Levy
Pages: 245 - 256
doi>10.1145/378993.379245
Full text: Pdf

This paper presents the first analysis of operating system execution on a simultaneous multithreaded (SMT) processor. While SMT has been studied extensively over the past 6 years, previous research has focused entirely on user-mode execution. However, ... expand
Slipstream processors: improving both performance and fault tolerance
Karthik Sundaramoorthy, Zach Purser, Eric Rotenburg
Pages: 257 - 268
doi>10.1145/378993.379247
Full text: Pdf

Processors execute the full dynamic instruction stream to arrive at the final output of a program, yet there exist shorter instruction streams that produce the same overall effect. We propose creating a shorter but otherwise equivalent version of the ... expand

2002

	Keynote address: Sensor network research: emerging challenges for architecture, systems, and languages
Deborah Estrin
Pages: 1 - 4
doi>10.1145/605397.1090192
SESSION: Multiprocessor synchronization and speculation
Transactional lock-free execution of lock-based programs
Ravi Rajwar, James R. Goodman
Pages: 5 - 17
doi>10.1145/605397.605399
Full text: Pdf

This paper is motivated by the difficulty in writing correct high-performance programs. Writing shared-memory multi-threaded programs imposes a complex trade-off between programming ease and performance, largely due to subtleties in coordinating access ... expand
Speculative synchronization: applying thread-level speculation to explicitly parallel applications
Jos� F. Mart�nez, Josep Torrellas
Pages: 18 - 29
doi>10.1145/605397.605400
Full text: Pdf

Barriers, locks, and flags are synchronizing operations widely used programmers and parallelizing compilers to produce race-free parallel programs. Often times, these operations are placed suboptimally, either because of conservative assumptions about ... expand
Temporally silent stores
Kevin M. Lepak, Mikko H. Lipasti
Pages: 30 - 41
doi>10.1145/605397.605401
Full text: Pdf

Recent work has shown that silent stores--stores which write a value matching the one already stored at the memory location--occur quite frequently and can be exploited to reduce memory traffic and improve performance. This paper extends the definition ... expand
SESSION: System performance and optimization
Automatically characterizing large scale program behavior
Timothy Sherwood, Erez Perelman, Greg Hamerly, Brad Calder
Pages: 45 - 57
doi>10.1145/605397.605403
Full text: Pdf

Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the very largest of scales (over the complete execution of the program). This realization has ... expand
Bytecode fetch optimization for a Java interpreter
Kazunori Ogata, Hideaki Komatsu, Toshio Nakatani
Pages: 58 - 67
doi>10.1145/605397.605404
Full text: Pdf

Interpreters play an important role in many languages, and their performance is critical particularly for the popular language Java. The performance of the interpreter is important even for high-performance virtual machines that employ just-in-time compiler ... expand
Understanding and improving operating system effects in control flow prediction
Tao Li, Lizy Kurian John, Anand Sivasubramaniam, N. Vijaykrishnan, Juan Rubio
Pages: 68 - 80
doi>10.1145/605397.605405
Full text: Pdf

Many modern applications result in a significant operating system (OS) component. The OS component has several implications including affecting the control flow transfer in the execution environment. This paper focuses on understanding the operating ... expand
SESSION: Emerging systems
Mat�: a tiny virtual machine for sensor networks
Philip Levis, David Culler
Pages: 85 - 95
doi>10.1145/605397.605407
Full text: Pdf

Composed of tens of thousands of tiny devices with very limited resources ("motes"), sensor networks are subject to novel systems problems and constraints. The large number of motes in a sensor network means that there will often be some failing nodes; ... expand
Energy-efficient computing for wildlife tracking: design tradeoffs and early experiences with ZebraNet
Philo Juang, Hidekazu Oki, Yong Wang, Margaret Martonosi, Li Shiuan Peh, Daniel Rubenstein
Pages: 96 - 107
doi>10.1145/605397.605408
Full text: Pdf

Over the past decade, mobile computing and wireless communication have become increasingly important drivers of many new computing applications. The field of wireless sensor networks particularly focuses on applications involving autonomous use of compute, ... expand
Enabling trusted software integrity
Darko Kirovski, Milenko Drini?, Miodrag Potkonjak
Pages: 108 - 120
doi>10.1145/605397.605409
Full text: Pdf

Preventing execution of unauthorized software on a given computer plays a pivotal role in system security. The key problem is that although a program at the beginning of its execution can be verified as authentic, while running, its execution flow can ... expand
SESSION: Energy efficient systems
ECOSystem: managing energy as a first class operating system resource
Heng Zeng, Carla S. Ellis, Alvin R. Lebeck, Amin Vahdat
Pages: 123 - 132
doi>10.1145/605397.605411
Full text: Pdf

Energy consumption has recently been widely recognized as a major challenge of computer systems design. This paper explores how to support energy as a first-class operating system resource. Energy, because of its global system nature, presents challenges ... expand
Cool-Mem: combining statically speculative memory accessing with selective address translation for energy efficiency
Raksit Ashok, Saurabh Chheda, Csaba Andras Moritz
Pages: 133 - 143
doi>10.1145/605397.605412
Full text: Pdf

This paper presents Cool-Mem, a family of memory system architectures that integrate conventional memory system mechanisms, energy-aware address translation, and compiler-enabled cache disambiguation techniques, to reduce energy consumption in general ... expand
Joint local and global hardware adaptations for energy
Ruchira Sasanka, Christopher J. Hughes, Sarita V. Adve
Pages: 144 - 155
doi>10.1145/605397.605413
Full text: Pdf

This work concerns algorithms to control energy-driven architecture adaptations for multimedia applications, without and with dynamic voltage scaling (DVS). We identify a broad design space for adaptation control algorithms based on two attributes: (1) ... expand
SESSION: Speculative threads
Design and evaluation of compiler algorithms for pre-execution
Dongkeun Kim, Donald Yeung
Pages: 159 - 170
doi>10.1145/605397.605415
Full text: Pdf

Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of ... expand
Compiler optimization of scalar value communication between speculative threads
Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan, Todd C. Mowry
Pages: 171 - 183
doi>10.1145/605397.605416
Full text: Pdf

While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In ... expand
Enhancing software reliability with speculative threads
Jeffrey Oplinger, Monica S. Lam
Pages: 184 - 196
doi>10.1145/605397.605417
Full text: Pdf

This paper advocates the use of a monitor-and-recover programming paradigm to enhance the reliability of software, and proposes an architectural design that allows software and hardware to cooperate in making this paradigm more efficient and easier to ... expand
SESSION: Computer architecture
Dynamic dead-instruction detection and elimination
J. Adam Butts, Guri Sohi
Pages: 199 - 210
doi>10.1145/605397.605419
Full text: Pdf

We observe a non-negligible fraction--3 to 16% in our benchmarks--of dynamically dead instructions, dynamic instruction instances that generate unused results. The majority of these instructions arise from static instructions that also produce ... expand
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Changkyu Kim, Doug Burger, Stephen W. Keckler
Pages: 211 - 222
doi>10.1145/605397.605420
Full text: Pdf

Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the ... expand
A comparative study of arbitration algorithms for the Alpha 21364 pipelined router
Shubhendu S. Mukherjee, Federico Silla, Peter Bannon, Joel Emer, Steve Lang, David Webb
Pages: 223 - 234
doi>10.1145/605397.605421
Full text: Pdf

Interconnection networks usually consist of a fabric of interconnected routers, which receive packets arriving at their input ports and forward them to appropriate output ports. Unfortunately, network packets moving through these routers are often delayed ... expand
SESSION: Communication abstractions and optimizations
Increasing web server throughput with network interface data caching
Hyong-youb Kim, Vijay S. Pai, Scott Rixner
Pages: 239 - 250
doi>10.1145/605397.605423
Full text: Pdf

This paper introduces network interface data caching, a new technique to reduce local interconnect traffic on networking servers by caching frequently-requested content on a programmable network interface. The operating system on the host CPU determines ... expand
Programming language optimizations for modular router configurations
Eddie Kohler, Robert Morris, Benjie Chen
Pages: 251 - 263
doi>10.1145/605397.605424
Full text: Pdf

Networking systems such as Ensemble, the x-kernel, Scout, and Click achieve flexibility by building routers and other packet processors from modular components. Unfortunately, component designs are often slower than purpose-built code, and routers ... expand
Evolving RPC for active storage
Muthian Sivathanu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
Pages: 264 - 276
doi>10.1145/605397.605425
Full text: Pdf

We introduce Scriptable RPC (SRPC), an RPC-based framework that enables distributed system services to take advantage of active components. Technology trends point to a world where each component in a system (whether disk, network interface, or memory) ... expand
SESSION: Coordinating memory
A stateless, content-directed data prefetching mechanism
Robert Cooksey, Stephan Jourdan, Dirk Grunwald
Pages: 279 - 290
doi>10.1145/605397.605427
Full text: Pdf

Although central processor speeds continues to improve, improvements in overall system performance are increasingly hampered by memory latency, especially for pointer-intensive applications. To counter this loss of performance, numerous data and instruction ... expand
A stream compiler for communication-exposed architectures
Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman Amarasinghe
Pages: 291 - 303
doi>10.1145/605397.605428
Full text: Pdf

With the increasing miniaturization of transistors, wire delays are becoming a dominant factor in microprocessor performance. To address this issue, a number of emerging architectures contain replicated processing units with software-exposed communication ... expand
Mondrian memory protection
Emmett Witchel, Josh Cates, Krste Asanovi?
Pages: 304 - 316
doi>10.1145/605397.605429
Full text: Pdf

Mondrian memory protection (MMP) is a fine-grained protection scheme that allows multiple protection domains to flexibly share memory and export protected services. In contrast to earlier page-based systems, MMP allows arbitrary permissions control at ... expand

2004

	
Programming with transactional coherence and consistency (TCC)
Lance Hammond, Brian D. Carlstrom, Vicky Wong, Ben Hertzberg, Mike Chen, Christos Kozyrakis, Kunle Olukotun
Pages: 1 - 13
doi>10.1145/1024393.1024395
Full text: Pdf

Transactional Coherence and Consistency (TCC) offers a way to simplify parallel programming by executing all code within transactions. In TCC systems, transactions serve as the fundamental unit of parallel work, communication and coherence. As each transaction ... expand
Spatial computation
Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein
Pages: 14 - 26
doi>10.1145/1024393.1024396
Full text: Pdf

This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized ... expand
An ultra low-power processor for sensor networks
Virantha Ekanayake, Clinton Kelly, IV, Rajit Manohar
Pages: 27 - 36
doi>10.1145/1024393.1024397
Full text: Pdf

We present a novel processor architecture designed specifically for use in low-power wireless sensor-network nodes. Our sensor network asynchronous processor (SNAP/LE) is based on an asynchronous data-driven 16-bit RISC core with an extremely low-power ... expand
SESSION: Storage
D-SPTF: decentralized request distribution in brick-based storage systems
Christopher R. Lumb, Richard Golding
Pages: 37 - 47
doi>10.1145/1024393.1024399
Full text: Pdf

Distributed Shortest-Positioning Time First (D-SPTF) is a request distribution protocol for decentralized systems of storage servers. D-SPTF exploits high-speed interconnects to dynamically select which server, among those with a replica, should service ... expand
FAB: building distributed enterprise disk arrays from commodity components
Yasushi Saito, Svend Fr�lund, Alistair Veitch, Arif Merchant, Susan Spence
Pages: 48 - 58
doi>10.1145/1024393.1024400
Full text: Pdf

This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a ... expand
Deconstructing storage arrays
Timothy E. Denehy, John Bent, Florentina I. Popovici, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
Pages: 59 - 71
doi>10.1145/1024393.1024401
Full text: Pdf

We introduce Shear, a user-level software tool that characterizes RAID storage arrays. Shear employs a set of controlled algorithms combined with statistical techniques to automatically determine the important properties of a RAID system, including the ... expand
SESSION: Security
HIDE: an infrastructure for efficiently protecting information leakage on the address bus
Xiaotong Zhuang, Tao Zhang, Santosh Pande
Pages: 72 - 84
doi>10.1145/1024393.1024403
Full text: Pdf

XOM-based secure processor has recently been introduced as a mechanism to provide copy and tamper resistant execution. XOM provides support for encryption/decryption and integrity checking. However, neither XOM nor any other current approach adequately ... expand
Secure program execution via dynamic information flow tracking
G. Edward Suh, Jae W. Lee, David Zhang, Srinivas Devadas
Pages: 85 - 96
doi>10.1145/1024393.1024404
Full text: Pdf

We present a simple architectural mechanism called dynamic information flow tracking that can significantly improve the security of computing systems with negligible performance overhead. Dynamic information flow tracking protects programs against malicious ... expand
SESSION: Architecture
Coherence decoupling: making use of incoherence
Jaehyuk Huh, Jichuan Chang, Doug Burger, Gurindar S. Sohi
Pages: 97 - 106
doi>10.1145/1024393.1024406
Full text: Pdf

This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces ... expand
Continual flow pipelines
Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton
Pages: 107 - 119
doi>10.1145/1024393.1024407
Full text: Pdf

Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides ... expand
Scalable selective re-execution for EDGE architectures
Rajagopalan Desikan, Simha Sethumadhavan, Doug Burger, Stephen W. Keckler
Pages: 120 - 132
doi>10.1145/1024393.1024408
Full text: Pdf

Pipeline flushes are becoming increasingly expensive in modern microprocessors with large instruction windows and deep pipelines. Selective re-execution is a technique that can reduce the penalty of mis-speculations by re-executing only instructions ... expand
SESSION: Potpourri
HOIST: a system for automatically deriving static analyzers for embedded systems
John Regehr, Alastair Reid
Pages: 133 - 143
doi>10.1145/1024393.1024410
Full text: Pdf

Embedded software must meet conflicting requirements such as be-ing highly reliable, running on resource-constrained platforms, and being developed rapidly. Static program analysis can help meet all of these goals. People developing analyzers for embedded ... expand
Helper threads via virtual multithreading on an experimental itanium� 2 processor-based platform
Perry H. Wang, Jamison D. Collins, Hong Wang, Dongkeun Kim, Bill Greene, Kai-Ming Chan, Aamir B. Yunus, Terry Sych, Stephen F. Moore, John P. Shen
Pages: 144 - 155
doi>10.1145/1024393.1024411
Full text: Pdf

Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads ... expand
Low-overhead memory leak detection using adaptive statistical profiling
Matthias Hauswirth, Trishul M. Chilimbi
Pages: 156 - 164
doi>10.1145/1024393.1024412
Full text: Pdf

Sampling has been successfully used to identify performance optimization opportunities. We would like to apply similar techniques to check program correctness. Unfortunately, sampling provides poor coverage of infrequently executed code, where bugs often ... expand
SESSION: Memory system analysis and optimization
Locality phase prediction
Xipeng Shen, Yutao Zhong, Chen Ding
Pages: 165 - 176
doi>10.1145/1024393.1024414
Full text: Pdf

As computer memory hierarchy becomes adaptive, its performance increasingly depends on forecasting the dynamic program locality. This paper presents a method that predicts the locality phases of a program by a combination of locality profiling and run-time ... expand
Dynamic tracking of page miss ratio curve for memory management
Pin Zhou, Vivek Pandey, Jagadeesan Sundaresan, Anand Raghuraman, Yuanyuan Zhou, Sanjeev Kumar
Pages: 177 - 188
doi>10.1145/1024393.1024415
Full text: Pdf

Memory can be efficiently utilized if the dynamic memory demands of applications can be determined and analyzed at run-time. The page miss ratio curve(MRC), i.e. page miss rate vs. memory size curve, is a good performance-directed metric to serve this ... expand
Compiler orchestrated prefetching via speculation and predication
Rodric M. Rabbah, Hariharan Sandanagobalane, Mongkol Ekpanyapong, Weng-Fai Wong
Pages: 189 - 198
doi>10.1145/1024393.1024416
Full text: Pdf

This paper introduces a compiler orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program ... expand
Software prefetching for mark-sweep garbage collection: hardware analysis and software redesign
Chen-Yong Cher, Antony L. Hosking, T. N. Vijaykumar
Pages: 199 - 210
doi>10.1145/1024393.1024417
Full text: Pdf

Tracing garbage collectors traverse references from live program variables, transitively tracing out the closure of live objects. Memory accesses incurred during tracing are essentially random: a given object may contain references to any other object. ... expand
SESSION: Reliability
Devirtualizable virtual machines enabling general, single-node, online maintenance
David E. Lowell, Yasushi Saito, Eileen J. Samberg
Pages: 211 - 223
doi>10.1145/1024393.1024419
Full text: Pdf

Maintenance is the dominant source of downtime at high availability sites. Unfortunately, the dominant mechanism for reducing this downtime, cluster rolling upgrade, has two shortcomings that have prevented its broad acceptance. First, cluster-style ... expand
Fingerprinting: bounding soft-error detection latency and bandwidth
Jared C. Smolens, Brian T. Gold, Jangwoo Kim, Babak Falsafi, James C. Hoe, Andreas G. Nowatzyk
Pages: 224 - 234
doi>10.1145/1024393.1024420
Full text: Pdf

Recent studies have suggested that the soft-error rate in microprocessor logic will become a reliability concern by 2010. This paper proposes an efficient error detection technique, called fingerprinting, that detects differences in execution ... expand
Application-level checkpointing for shared memory programs
Greg Bronevetsky, Daniel Marques, Keshav Pingali, Peter Szwed, Martin Schulz
Pages: 235 - 247
doi>10.1145/1024393.1024421
Full text: Pdf

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and ... expand
SESSION: Power
Formal online methods for voltage/frequency control in multiple clock domain microprocessors
Qiang Wu, Philo Juang, Margaret Martonosi, Douglas W. Clark
Pages: 248 - 259
doi>10.1145/1024393.1024423
Full text: Pdf

Multiple Clock Domain (MCD) processors are a promising future alternative to today's fully synchronous designs. Dynamic Voltage and Frequency Scaling (DVFS) in an MCD processor has the extra flexibility to adjust the voltage and frequency in each domain ... expand
Heat-and-run: leveraging SMT and CMP to manage power density through the operating system
Mohamed Gomaa, Michael D. Powell, T. N. Vijaykumar
Pages: 260 - 270
doi>10.1145/1024393.1024424
Full text: Pdf

Power density in high-performance processors continues to increase with technology generations as scaling of current, clock speed, and device density outpaces the downscaling of supply voltage and thermal ability of packages to dissipate heat. Power ... expand
Performance directed energy management for main memory and disks
Xiaodong Li, Zhenmin Li, Francis David, Pin Zhou, Yuanyuan Zhou, Sarita Adve, Sanjeev Kumar
Pages: 271 - 283
doi>10.1145/1024393.1024425
Full text: Pdf

Much research has been conducted on energy management for memory and disks. Most studies use control algorithms that dynamically transition devices to low power modes after they are idle for a certain threshold period of time. The control algorithms ... expand

2006

	Impact of virtualization on computer architecture and operating systems
Mendel Rosenblum
Pages: 1 - 1
doi>10.1145/1168857.1168858
Full text: Pdf

Abstract This talk describes how virtualization is changing the way computing is done in the industry today and how it is causing users to rethink how they view hardware, operating systems, and application programs. The talk will describe this new view ... expand
SESSION: Virtualization
A comparison of software and hardware techniques for x86 virtualization
Keith Adams, Ole Agesen
Pages: 2 - 13
doi>10.1145/1168857.1168860
Full text: Pdf

Until recently, the x86 architecture has not permitted classical trap-and-emulate virtualization. Virtual Machine Monitors for x86, such as VMware � Workstation and Virtual PC, have instead used binary translation of the guest kernel code. However, ... expand
Geiger: monitoring the buffer cache in a virtual machine environment
Stephen T. Jones, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
Pages: 14 - 24
doi>10.1145/1168857.1168861
Full text: Pdf

Virtualization is increasingly being used to address server management and administration issues like flexible resource allocation, service isolation and workload migration. In a virtualized environment, the virtual machine monitor (VMM) is the primary ... expand
Temporal search: detecting hidden malware timebombs with virtual machines
Jedidiah R. Crandall, Gary Wassermann, Daniela A. S. de Oliveira, Zhendong Su, S. Felix Wu, Frederic T. Chong
Pages: 25 - 36
doi>10.1145/1168857.1168862
Full text: Pdf

Worms, viruses, and other malware can be ticking bombs counting down to a specific time, when they might, for example, delete files or download new instructions from a public web server. We propose a novel virtual-machine-based analysis technique to ... expand
SESSION: Races and memory debugging I
AVIO: detecting atomicity violations via access interleaving invariants
Shan Lu, Joseph Tucek, Feng Qin, Yuanyuan Zhou
Pages: 37 - 48
doi>10.1145/1168857.1168864
Full text: Pdf

Concurrency bugs are among the most difficult to test and diagnose of all software bugs. The multicore technology trend worsens this problem. Most previous concurrency bug detection work focuses on one bug subclass, data races, and neglects many other ... expand
A regulated transitive reduction (RTR) for longer memory race recording
Min Xu, Mark D. Hill, Rastislav Bodik
Pages: 49 - 60
doi>10.1145/1168857.1168865
Full text: Pdf

Now at VMware. Multithreaded deterministic replay has important applications in cyclic debugging, fault tolerance and intrusion analysis. Memory race recording is a key technology for multithreaded deterministic replay. In this paper, we considerably ... expand
Bell: bit-encoding online memory leak detection
Michael D. Bond, Kathryn S. McKinley
Pages: 61 - 72
doi>10.1145/1168857.1168866
Full text: Pdf

Memory leaks compromise availability and security by crippling performance and crashing programs. Leaks are difficult to diagnose because they have no immediate symptoms. Online leak detection tools benefit from storing and reporting per-object sites ... expand
SESSION: Hardware reliability and fault tolerance
Ultra low-cost defect protection for microprocessor pipelines
Smitha Shyam, Kypros Constantinides, Sujay Phadke, Valeria Bertacco, Todd Austin
Pages: 73 - 82
doi>10.1145/1168857.1168868
Full text: Pdf

The sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing defects, ... expand
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance
Vimal K. Reddy, Eric Rotenberg, Sailashri Parthasarathy
Pages: 83 - 94
doi>10.1145/1168857.1168869
Full text: Pdf

Redundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance duplicates ... expand
SlicK: slice-based locality exploitation for efficient redundant multithreading
Angshuman Parashar, Anand Sivasubramaniam, Sudhanva Gurumurthi
Pages: 95 - 105
doi>10.1145/1168857.1168870
Full text: Pdf

Transient faults are expected a be a major design consideration in future microprocessors. Recent proposals for transient fault detection in processor cores have revolved around the idea of redundant threading, which involves redundant execution of a ... expand
SESSION: Energy efficiency
Mercury and freon: temperature emulation and management for server systems
Taliver Heath, Ana Paula Centeno, Pradeep George, Luiz Ramos, Yogesh Jaluria, Ricardo Bianchini
Pages: 106 - 116
doi>10.1145/1168857.1168872
Full text: Pdf

Power densities have been increasing rapidly at all levels of server systems. To counter the high temperatures resulting from these densities, systems researchers have recently started work on softwarebased thermal management. Unfortunately, research ... expand
PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor
Taeho Kgil, Shaun D'Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski, Trevor Mudge, Steven Reinhardt, Krisztian Flautner
Pages: 117 - 128
doi>10.1145/1168857.1168873
Full text: Pdf

In this paper, we show how 3D stacking technology can be used to implement a simple, low-power, high-performance chip multiprocessor suitable for throughput processing. Our proposed architecture, PicoServer, employs 3D technology to bond one die containing ... expand
SESSION: Scheduling and spatial programming
A spatial path scheduling algorithm for EDGE architectures
Katherine E. Coons, Xia Chen, Doug Burger, Kathryn S. McKinley, Sundeep K. Kushwaha
Pages: 129 - 140
doi>10.1145/1168857.1168875
Full text: Pdf

Growing on-chip wire delays are motivating architectural features that expose on-chip communication to the compiler. EDGE architectures are one example of communication-exposed microarchitectures in which the compiler forms dataflow graphs that specify ... expand
Instruction scheduling for a tiled dataflow architecture
Martha Mercaldi, Steven Swanson, Andrew Petersen, Andrew Putnam, Andrew Schwerin, Mark Oskin, Susan J. Eggers
Pages: 141 - 150
doi>10.1145/1168857.1168876
Full text: Pdf

This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned ... expand
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs
Michael I. Gordon, William Thies, Saman Amarasinghe
Pages: 151 - 162
doi>10.1145/1168857.1168877
Full text: Pdf

As multicore architectures enter the mainstream, there is a pressing demand for high-level programming models that can effectively map to them. Stream programming offers an attractive way to expose coarse-grained parallelism, as streaming applications ... expand
Tartan: evaluating spatial computation for whole program execution
Mahim Mishra, Timothy J. Callahan, Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein, Mihai Budiu
Pages: 163 - 174
doi>10.1145/1168857.1168878
Full text: Pdf

Spatial Computing (SC) has been shown to be an energy-efficient model for implementing program kernels. In this paper we explore the feasibility of using SC for more than small kernels. To this end, we evaluate the performance and energy efficiency of ... expand
SESSION: Estimation and prediction of power and performance
A performance counter architecture for computing accurate CPI components
Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, James E. Smith
Pages: 175 - 184
doi>10.1145/1168857.1168880
Full text: Pdf

A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into ... expand
Accurate and efficient regression modeling for microarchitectural performance and power prediction
Benjamin C. Lee, David M. Brooks
Pages: 185 - 194
doi>10.1145/1168857.1168881
Full text: Pdf

We propose regression modeling as an efficient approach for accurately predicting performance and power for various applications executing on any microprocessor configuration in a large microarchitectural design space. This paper addresses fundamental ... expand
Efficiently exploring architectural design spaces via predictive modeling
Engin �pek, Sally A. McKee, Rich Caruana, Bronis R. de Supinski, Martin Schulz
Pages: 195 - 206
doi>10.1145/1168857.1168882
Full text: Pdf

Architects use cycle-by-cycle simulation to evaluate design choices and understand tradeoffs and interactions among design parameters. Efficiently exploring exponential-size design spaces with many interacting parameters remains an open problem: the ... expand
SESSION: Races and memory debugging II
Comprehensively and efficiently protecting the heap
Mazen Kharbutli, Xiaowei Jiang, Yan Solihin, Guru Venkataramani, Milos Prvulovic
Pages: 207 - 218
doi>10.1145/1168857.1168884
Full text: Pdf

The goal of this paper is to propose a scheme that provides comprehensive security protection for the heap. Heap vulnerabilities are increasingly being exploited for attacks on computer programs. In most implementations, the heap management library keeps ... expand
HeapMD: identifying heap-based bugs using anomaly detection
Trishul M. Chilimbi, Vinod Ganapathy
Pages: 219 - 228
doi>10.1145/1168857.1168885
Full text: Pdf

We present the design, implementation, and evaluation of HeapMD, a dynamic analysis tool that finds heap-based bugs using anomaly detection. HeapMD is based upon the observation that, in spite of the evolving nature of the heap, several of its properties ... expand
Recording shared memory dependencies using strata
Satish Narayanasamy, Cristiano Pereira, Brad Calder
Pages: 229 - 240
doi>10.1145/1168857.1168886
Full text: Pdf

Significant time is spent by companies trying to reproduce and fix bugs. BugNet and FDR are recent architecture proposals that provide architecture support for deterministic replay debugging. They focus on continuously recording information about the ... expand
SESSION: Emerging technologies
A defect tolerant self-organizing nanoscale SIMD architecture
Jaidev P. Patwardhan, Vijeta Johri, Chris Dwyer, Alvin R. Lebeck
Pages: 241 - 251
doi>10.1145/1168857.1168888
Full text: Pdf

The continual decrease in transistor size (through either scaled CMOS or emerging nano-technologies) promises to usher in an era of tera to peta-scale integration. However, this decrease in size is also likely to increase defect densities, contributing ... expand
A program transformation and architecture support for quantum uncomputation
Ethan Schuchman, T. N. Vijaykumar
Pages: 252 - 263
doi>10.1145/1168857.1168889
Full text: Pdf

Quantum computing's power comes from new algorithms that exploit quantum mechanical phenomena for computation. Quantum algorithms are different from their classical counterparts in that quantum algorithms rely on algorithmic structures that are simply ... expand
Introspective 3D chips
Shashidhar Mysore, Banit Agrawal, Navin Srivastava, Sheng-Chih Lin, Kaustav Banerjee, Tim Sherwood
Pages: 264 - 273
doi>10.1145/1168857.1168890
Full text: Pdf

While the number of transistors on a chip increases exponentially over time, the productivity that can be realized from these systems has not kept pace. To deal with the complexity of modern systems, software developers are increasingly dependent on ... expand
SESSION: Memory and locality issues
Stealth prefetching
Jason F. Cantin, Mikko H. Lipasti, James E. Smith
Pages: 274 - 282
doi>10.1145/1168857.1168892
Full text: Pdf

Prefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching techniques ... expand
Computation spreading: employing hardware migration to specialize CMP cores on-the-fly
Koushik Chakraborty, Philip M. Wells, Gurindar S. Sohi
Pages: 283 - 292
doi>10.1145/1168857.1168893
Full text: Pdf

In canonical parallel processing, the operating system (OS) assigns a processing core to a single thread from a multithreaded server application. Since different threads from the same application often carry out similar computation, albeit at different ... expand
Software-based instruction caching for embedded processors
Jason E. Miller, Anant Agarwal
Pages: 293 - 302
doi>10.1145/1168857.1168894
Full text: Pdf

While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly addressed ... expand
SESSION: Embedded and special-purpose systems
Mapping esterel onto a multi-threaded embedded processor
Xin Li, Marian Boldt, Reinhard von Hanxleden
Pages: 303 - 314
doi>10.1145/1168857.1168896
Full text: Pdf

The synchronous language Esterel is well-suited for programming control-dominated reactive systems at the system level. It provides non-traditional control structures, in particular concurrency and various forms of preemption, which allow to concisely ... expand
Integrated network interfaces for high-bandwidth TCP/IP
Nathan L. Binkert, Ali G. Saidi, Steven K. Reinhardt
Pages: 315 - 324
doi>10.1145/1168857.1168897
Full text: Pdf

This paper proposes new network interface controller (NIC) designs that take advantage of integration with the host CPU to provide increased flexibility for operating system kernel-based performance optimization.We believe that this approach is more ... expand
Accelerator: using data parallelism to program GPUs for general-purpose uses
David Tarditi, Sidd Puri, Jose Oglesby
Pages: 325 - 335
doi>10.1145/1168857.1168898
Full text: Pdf

GPUs are difficult to program for general-purpose uses. Programmers can either learn graphics APIs and convert their applications to use graphics pipeline operations or they can use stream programming abstractions of GPUs. We describe Accelerator, a ... expand
SESSION: Transactional memory
Hybrid transactional memory
Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, Daniel Nussbaum
Pages: 336 - 346
doi>10.1145/1168857.1168900
Full text: Pdf

Transactional memory (TM) promises to substantially reduce the difficulty of writing correct, efficient, and scalable concurrent programs. But "bounded" and "best-effort" hardware TM proposals impose unreasonable constraints on programmers, while more ... expand
Unbounded page-based transactional memory
Weihaw Chuang, Satish Narayanasamy, Ganesh Venkatesh, Jack Sampson, Michael Van Biesbrouck, Gilles Pokam, Brad Calder, Osvaldo Colavin
Pages: 347 - 358
doi>10.1145/1168857.1168901
Full text: Pdf

Exploiting thread level parallelism is paramount in the multicore era. Transactions enable programmers to expose such parallelism by greatly simplifying the multi-threaded programming model. Virtualized transactions (unbounded in space and time) are ... expand
Supporting nested transactional memory in logTM
Michelle J. Moravan, Jayaram Bobba, Kevin E. Moore, Luke Yen, Mark D. Hill, Ben Liblit, Michael M. Swift, David A. Wood
Pages: 359 - 370
doi>10.1145/1168857.1168902
Full text: Pdf

Nested transactional memory (TM) facilitates software composition by letting one module invoke another without either knowing whether the other uses transactions. Closed nested transactions extend isolation of an inner transaction until the toplevel ... expand
Tradeoffs in transactional memory virtualization
JaeWoong Chung, Chi Cao Minh, Austen McDonald, Travis Skare, Hassan Chafi, Brian D. Carlstrom, Christos Kozyrakis, Kunle Olukotun
Pages: 371 - 381
doi>10.1145/1168857.1168903
Full text: Pdf

For transactional memory (TM) to achieve widespread acceptance, transactions should not be limited to the physical resources of any specific hardware implementation. TM systems should guarantee correct execution even when transactions exceed scheduling ... expand
SESSION: Compilation
A new idiom recognition framework for exploiting hardware-assist instructions
Motohiro Kawahito, Hideaki Komatsu, Takao Moriyama, Hiroshi Inoue, Toshio Nakatani
Pages: 382 - 393
doi>10.1145/1168857.1168905
Full text: Pdf

Modern processors support hardware-assist instructions (such as TRT and TROT instructions on IBM zSeries) to accelerate certain functions such as delimiter search and character conversion. Such special instructions have often been used in high performance ... expand
Automatic generation of peephole superoptimizers
Sorav Bansal, Alex Aiken
Pages: 394 - 403
doi>10.1145/1168857.1168906
Full text: Pdf

Peephole optimizers are typically constructed using human-written pattern matching rules, an approach that requires expertise and time, as well as being less than systematic at exploiting all opportunities for optimization. We explore fully automatic ... expand
Combinatorial sketching for finite programs
Armando Solar-Lezama, Liviu Tancau, Rastislav Bodik, Sanjit Seshia, Vijay Saraswat
Pages: 404 - 415
doi>10.1145/1168857.1168907
Full text: Pdf

Sketching is a software synthesis approach where the programmer develops a partial implementation - a sketch - and a separate specification of the desired functionality. The synthesizer then completes the sketch to behave like the specification. The ... expand
A probabilistic pointer analysis for speculative optimizations
Jeff Da Silva, J. Gregory Steffan
Pages: 416 - 425
doi>10.1145/1168857.1168908
Full text: Pdf

Pointer analysis is a critical compiler analysis used to disambiguate the indirect memory references that result from the use of pointers and pointer-based data structures. A conventional pointer analysis deduces for every pair of pointers, at any program ... expand

2008


Toward molecular programming with DNA
Erik Winfree
Pages: 1-1
doi>10.1145/1346281.1346282
Full text: Flv  Mp3 Audio Only

Biological organisms are beautiful examples of programming. The program and data are stored in biological molecules such as DNA, RNA, and proteins; the algorithms are carried out by molecular and biochemical processes; and the end result is the creation ... expand
SESSION: Virtualization
Overshadow: a virtualization-based approach to retrofitting protection in commodity operating systems
Xiaoxin Chen, Tal Garfinkel, E. Christopher Lewis, Pratap Subrahmanyam, Carl A. Waldspurger, Dan Boneh, Jeffrey Dwoskin, Dan R.K. Ports
Pages: 2-13
doi>10.1145/1346281.1346284
Full text: PDF
Other formats:  Avi  Flv  Mp3 Audio Only

Commodity operating systems entrusted with securing sensitive data are remarkably large and complex, and consequently, frequently prone to compromise. To address this limitation, we introduce a virtual-machine-based system called Overshadow that protects ... expand
How low can you go?: recommendations for hardware-supported minimal TCB code execution
Jonathan M. McCune, Bryan Parno, Adrian Perrig, Michael K. Reiter, Arvind Seshadri
Pages: 14-25
doi>10.1145/1346281.1346285
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

We explore the extent to which newly available CPU-based security technology can reduce the Trusted Computing Base (TCB) for security-sensitive applications. We find that although this new technology represents a step in the right direction, significant ... expand
Accelerating two-dimensional page walks for virtualized systems
Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, Srilatha Manne
Pages: 26-35
doi>10.1145/1346281.1346286
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Nested paging is a hardware solution for alleviating the software memory management overhead imposed by system virtualization. Nested paging complements existing page walk hardware to form a two-dimensional (2D) page walk, which reduces the need for ... expand
SESSION: Power
Efficiency trends and limits from comprehensive microarchitectural adaptivity
Benjamin C. Lee, David Brooks
Pages: 36-47
doi>10.1145/1346281.1346288
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Increasing demand for power-efficient, high-performance computing requires tuning applications and/or the underlying hardware to improve the mapping between workload heterogeneity and computational resources. To assess the potential benefits of hardware ... expand
No "power" struggles: coordinated multi-level power management for the data center
Ramya Raghavendra, Parthasarathy Ranganathan, Vanish Talwar, Zhikui Wang, Xiaoyun Zhu
Pages: 48-59
doi>10.1145/1346281.1346289
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Power delivery, electricity consumption, and heat management are becoming key challenges in data center environments. Several past solutions have individually evaluated different techniques to address separate aspects of this problem, in hardware and ... expand
Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors
Chinnakrishnan S. Ballapuram, Ahmad Sharif, Hsien-Hsin S. Lee
Pages: 60-69
doi>10.1145/1346281.1346290
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Integrating more processor cores on-die has become the unanimous trend in the microprocessor industry. Most of the current research thrusts using chip multiprocessors (CMPs) as the baseline to analyze problems in various domains. One of the main design ... expand
PICSEL: measuring user-perceived performance to control dynamic frequency scaling
Arindam Mallik, Jack Cosgrove, Robert P. Dick, Gokhan Memik, Peter Dinda
Pages: 70-79
doi>10.1145/1346281.1346291
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

The ultimate goal of a computer system is to satisfy its users. The success of architectural or system-level optimizations depends largely on having accurate metrics for user satisfaction. We propose to derive such metrics from information that is "close ... expand
SESSION: Programming
Improving the performance of object-oriented languages with dynamic predication of indirect jumps
Jose A. Joao, Onur Mutlu, Hyesoon Kim, Rishi Agarwal, Yale N. Patt
Pages: 80-90
doi>10.1145/1346281.1346293
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Indirect jump instructions are used to implement increasingly-common programming constructs such as virtual function calls, switch-case statements, jump tables, and interface calls. The performance impact of indirect jumps is likely to increase because ... expand
The mapping collector: virtual memory support for generational, parallel, and concurrent compaction
Michal Wegiel, Chandra Krintz
Pages: 91-102
doi>10.1145/1346281.1346294
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Parallel and concurrent garbage collectors are increasingly employed by managed runtime environments (MREs) to maintain scalability, as multi-core architectures and multi-threaded applications become pervasive. Moreover, state-of-the-art MREs commonly ... expand
Hardbound: architectural support for spatial safety of the C programming language
Joe Devietti, Colin Blundell, Milo M. K. Martin, Steve Zdancewic
Pages: 103-114
doi>10.1145/1346281.1346295
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

The C programming language is at least as well known for its absence of spatial memory safety guarantees (i.e., lack of bounds checking) as it is for its high performance. C's unchecked pointer arithmetic and array indexing allow simple programming mistakes ... expand
Archipelago: trading address space for reliability and security
Vitaliy B. Lvin, Gene Novark, Emery D. Berger, Benjamin G. Zorn
Pages: 115-124
doi>10.1145/1346281.1346296
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Memory errors are a notorious source of security vulnerabilities that can lead to service interruptions, information leakage and unauthorized access. Because such errors are also difficult to debug, the absence of timely patches can leave users vulnerable ... expand
SESSION: Microarchitecture
Accurate branch prediction for short threads
Bumyong Choi, Leo Porter, Dean M. Tullsen
Pages: 125-134
doi>10.1145/1346281.1346298
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Multi-core processors, with low communication costs and high availability of execution cores, will increase the use of execution and compilation models that use short threads to expose parallelism. Current branch predictors seek to incorporate large ... expand
Adaptive set pinning: managing shared caches in chip multiprocessors
Shekhar Srikantaiah, Mahmut Kandemir, Mary Jane Irwin
Pages: 135-144
doi>10.1145/1346281.1346299
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

As part of the trend towards Chip Multiprocessors (CMPs) for the next leap in computing performance, many architectures have explored sharing the last level of cache among different processors for better performance-cost ratio and improved resource allocation. ... expand
SoftSig: software-exposed hardware signatures for code analysis and optimization
James Tuck, Wonsun Ahn, Luis Ceze, Josep Torrellas
Pages: 145-156
doi>10.1145/1346281.1346300
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Many code analysis techniques for optimization, debugging, or parallelization need to perform runtime disambiguation of sets of addresses. Such operations can be supported efficiently and with low complexity with hardware signatures. To enable flexible ... expand
Predictor virtualization
Ioana Burcea, Stephen Somogyi, Andreas Moshovos, Babak Falsafi
Pages: 157-167
doi>10.1145/1346281.1346301
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Many hardware optimizations rely on collecting information about program behavior at runtime. This information is stored in lookup tables. To be accurate and effective, these optimizations usually require large dedicated on-chip tables. Although technology ... expand
SESSION: Performance
The design and implementation of microdrivers
Vinod Ganapathy, Matthew J. Renzelmann, Arini Balakrishnan, Michael M. Swift, Somesh Jha
Pages: 168-178
doi>10.1145/1346281.1346303
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Device drivers commonly execute in the kernel to achieve high performance and easy access to kernel services. However, this comes at the price of decreased reliability and increased programming difficulty. Driver programmers are unable to use user-mode ... expand
Tapping into the fountain of CPUs: on operating system support for programmable devices
Yaron Weinsberg, Danny Dolev, Tal Anker, Muli Ben-Yehuda, Pete Wyckoff
Pages: 179-188
doi>10.1145/1346281.1346304
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

The constant race for faster and more powerful CPUs is drawing to a close. No longer is it feasible to significantly increase the speed of the CPU without paying a crushing penalty in power consumption and production costs. Instead of increasing single ... expand
SESSION: OS
Hardware counter driven on-the-fly request signatures
Kai Shen, Ming Zhong, Sandhya Dwarkadas, Chuanpeng Li, Christopher Stewart, Xiao Zhang
Pages: 189-200
doi>10.1145/1346281.1346306
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Today's processors provide a rich source of statistical informationon application execution through hardware counters. In this paper, we explore the utilization of these statistics as request signaturesin server applications for identifying requests ... expand
Dispersing proprietary applications as benchmarks through code mutation
Luk Van Ertvelde, Lieven Eeckhout
Pages: 201-210
doi>10.1145/1346281.1346307
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Industry vendors hesitate to disseminate proprietary applications to academia and third party vendors. By consequence, the benchmarking process is typically driven by standardized, open-source benchmarks which may be very different from and likely not ... expand
Understanding and visualizing full systems with data flow tomography
Shashidhar Mysore, Bita Mazloom, Banit Agrawal, Timothy Sherwood
Pages: 211-221
doi>10.1145/1346281.1346308
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

It is not uncommon for modern systems to be composed of a variety of interacting services, running across multiple machines in such a way that most developers do not really understand the whole system. As abstraction is layered atop abstraction, developers ... expand
SESSION: Compiler
Communication optimizations for global multi-threaded instruction scheduling
Guilherme Ottoni, David I. August
Pages: 222-232
doi>10.1145/1346281.1346310
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

The recent shift in the industry towards chip multiprocessor (CMP) designs has brought the need for multi-threaded applications to mainstream computing. As observed in several limit studies, most of the parallelization opportunities require looking for ... expand
Optimistic parallelism benefits from data partitioning
Milind Kulkarni, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala, L. Paul Chew
Pages: 233-243
doi>10.1145/1346281.1346311
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Recent studies of irregular applications such as finite-element mesh generators and data-clustering codes have shown that these applications have a generalized data parallelism arising from the use of iterative algorithms that perform computations on ... expand
Xoc, an extension-oriented compiler for systems programming
Russ Cox, Tom Bergan, Austin T. Clements, Frans Kaashoek, Eddie Kohler
Pages: 244-254
doi>10.1145/1346281.1346312
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Today's system programmers go to great lengths to extend the languages in which they program. For instance, system-specific compilers find errors in Linux and other systems, and add support for specialized control flow to Qt and event-based programs. ... expand
SESSION: Fault tolerance
Adapting to intermittent faults in multicore systems
Philip M. Wells, Koushik Chakraborty, Gurindar S. Sohi
Pages: 255-264
doi>10.1145/1346281.1346314
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Future multicore processors will be more susceptible to a variety of hardware failures. In particular, intermittent faults, caused in part by manufacturing, thermal, and voltage variations, can cause bursts of frequent faults that last from several ... expand
Understanding the propagation of hard errors to software and implications for resilient system design
Man-Lap Li, Pradeep Ramachandran, Swarup Kumar Sahoo, Sarita V. Adve, Vikram S. Adve, Yuanyuan Zhou
Pages: 265-276
doi>10.1145/1346281.1346315
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of expensive redundancy. We explore a cooperative ... expand
SESSION: Parallelism
Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs
M. Aater Suleman, Moinuddin K. Qureshi, Yale N. Patt
Pages: 277-286
doi>10.1145/1346281.1346317
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number ... expand
Merge: a programming model for heterogeneous multi-core systems
Michael D. Linderman, Jamison D. Collins, Hong Wang, Teresa H. Meng
Pages: 287-296
doi>10.1145/1346281.1346318
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

In this paper we propose the Merge framework, a general purpose programming model for heterogeneous multi-core systems. The Merge framework replaces current ad hoc approaches to parallel programming on heterogeneous platforms with a rigorous, library-based ... expand
Streamware: programming general-purpose multicore processors using streams
Jayanth Gummaraju, Joel Coburn, Yoshio Turner, Mendel Rosenblum
Pages: 297-307
doi>10.1145/1346281.1346319
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Recently, the number of cores on general-purpose processors has been increasing rapidly. Using conventional programming models, it is challenging to effectively exploit these cores for maximal performance. An interesting alternative candidate for programming ... expand
SESSION: Security & bugs
Parallelizing security checks on commodity hardware
Edmund B. Nightingale, Daniel Peek, Peter M. Chen, Jason Flinn
Pages: 308-318
doi>10.1145/1346281.1346321
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Speck (Speculative Parallel Check) is a system thataccelerates powerful security checks on commodity hardware by executing them in parallel on multiple cores. Speck provides an infrastructure that allows sequential invocations of a particular ... expand
Better bug reporting with better privacy
Miguel Castro, Manuel Costa, Jean-Philippe Martin
Pages: 319-328
doi>10.1145/1346281.1346322
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

Software vendors collect bug reports from customers to improve the quality of their software. These reports should include the inputs that make the software fail, to enable vendors to reproduce the bug. However, vendors rarely include these inputs in ... expand
Learning from mistakes: a comprehensive study on real world concurrency bug characteristics
Shan Lu, Soyeon Park, Eunsoo Seo, Yuanyuan Zhou
Pages: 329-339
doi>10.1145/1346281.1346323
Full text: PDF
Other formats:  Flv  Mp3 Audio Only

The reality of multi-core hardware has made concurrent programs pervasive. Unfortunately, writing correct concurrent programs is difficult. Addressing this challenge requires advances in multiple directions, including concurrency bug detection, concurrent ... expand

2009

	An evaluation of the TRIPS computer system
Mark Gebhart, Bertrand A. Maher, Katherine E. Coons, Jeff Diamond, Paul Gratz, Mario Marino, Nitya Ranganathan, Behnam Robatmili, Aaron Smith, James Burrill, Stephen W. Keckler, Doug Burger, Kathryn S. McKinley
Pages: 1-12
doi>10.1145/1508244.1508246
Full text: Pdf

The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model ... expand
Architectural implications of nanoscale integrated sensing and computing
Constantin Pistol, Wutichai Chongchitmate, Christopher Dwyer, Alvin R. Lebeck
Pages: 13-24
doi>10.1145/1508244.1508247
Full text: Pdf

This paper explores the architectural implications of integrating computation and molecular probes to form nanoscale sensor processors (nSP). We show how nSPs may enable new computing domains and automate tasks that currently require expert scientific ... expand
SESSION: Reliable systems I
CTrigger: exposing atomicity violation bugs from their hiding places
Soyeon Park, Shan Lu, Yuanyuan Zhou
Pages: 25-36
doi>10.1145/1508244.1508249
Full text: Pdf

Multicore hardware is making concurrent programs pervasive. Unfortunately, concurrent programs are prone to bugs. Among different types of concurrency bugs, atomicity violation bugs are common and important. Existing techniques to detect atomicity violation ... expand
ASSURE: automatic software self-healing using rescue points
Stelios Sidiroglou, Oren Laadan, Carlos Perez, Nicolas Viennot, Jason Nieh, Angelos D. Keromytis
Pages: 37-48
doi>10.1145/1508244.1508250
Full text: Pdf

Software failures in server applications are a significant problem for preserving system availability. We present ASSURE, a system that introduces rescue points that recover software from unknown faults while maintaining both system integrity and availability, ... expand
Recovery domains: an organizing principle for recoverable operating systems
Andrew Lenharth, Vikram S. Adve, Samuel T. King
Pages: 49-60
doi>10.1145/1508244.1508251
Full text: Pdf

We describe a strategy for enabling existing commodity operating systems to recover from unexpected run-time errors in nearly any part of the kernel, including core kernel components. Our approach is dynamic and request-oriented; it isolates the effects ... expand
Anomaly-based bug prediction, isolation, and validation: an automated approach for software debugging
Martin Dimitrov, Huiyang Zhou
Pages: 61-72
doi>10.1145/1508244.1508252
Full text: Pdf

Software defects, commonly known as bugs, present a serious challenge for system reliability and dependability. Once a program failure is observed, the debugging activities to locate the defects are typically nontrivial and time consuming. In this paper, ... expand
SESSION: Deterministic multiprocessing
Capo: a software-hardware interface for practical deterministic multiprocessor replay
Pablo Montesinos, Matthew Hicks, Samuel T. King, Josep Torrellas
Pages: 73-84
doi>10.1145/1508244.1508254
Full text: Pdf

While deterministic replay of parallel programs is a powerful technique, current proposals have shortcomings. Specifically, software-based replay systems have high overheads on multiprocessors, while hardware-based proposals focus only on basic hardware-level ... expand
DMP: deterministic shared memory multiprocessing
Joseph Devietti, Brandon Lucia, Luis Ceze, Mark Oskin
Pages: 85-96
doi>10.1145/1508244.1508255
Full text: Pdf

Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits ... expand
Kendo: efficient deterministic multithreading in software
Marek Olszewski, Jason Ansel, Saman Amarasinghe
Pages: 97-108
doi>10.1145/1508244.1508256
Full text: Pdf

Although chip-multiprocessors have become the industry standard, developing parallel applications that target them remains a daunting task. Non-determinism, inherent in threaded applications, causes significant challenges for parallel programmers by ... expand
SESSION: Prediction and accounting
Complete information flow tracking from the gates up
Mohit Tiwari, Hassan M.G. Wassel, Bita Mazloom, Shashidhar Mysore, Frederic T. Chong, Timothy Sherwood
Pages: 109-120
doi>10.1145/1508244.1508258
Full text: Pdf

For many mission-critical tasks, tight guarantees on the flow of information are desirable, for example, when handling important cryptographic keys or sensitive financial data. We present a novel architecture capable of tracking all information flow ... expand
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
David K. Tam, Reza Azimi, Livio B. Soares, Michael Stumm
Pages: 121-132
doi>10.1145/1508244.1508259
Full text: Pdf

Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed ... expand
Per-thread cycle accounting in SMT processors
Stijn Eyerman, Lieven Eeckhout
Pages: 133-144
doi>10.1145/1508244.1508260
Full text: Pdf

This paper proposes a cycle accounting architecture for Simultaneous Multithreading (SMT) processors that estimates the execution times for each of the threads had they been executed alone, while they are running simultaneously on the SMT processor. ... expand
SESSION: Transactional memories
Maximum benefit from a minimal HTM
Owen S. Hofmann, Christopher J. Rossbach, Emmett Witchel
Pages: 145-156
doi>10.1145/1508244.1508262
Full text: Pdf

A minimal, bounded hardware transactional memory implementation significantly improves synchronization performance when used in an operating system kernel. We add HTM to Linux 2.4, a kernel with a simple, coarse-grained synchronization structure. The ... expand
Early experience with a commercial hardware transactional memory implementation
Dave Dice, Yossi Lev, Mark Moir, Daniel Nussbaum
Pages: 157-168
doi>10.1145/1508244.1508263
Full text: Pdf

We report on our experience with the hardware transactional memory (HTM) feature of two pre-production revisions of a new commercial multicore processor. Our experience includes a number of promising results using HTM to improve performance in a variety ... expand
SESSION: Reliable systems II
Mixed-mode multicore reliability
Philip M. Wells, Koushik Chakraborty, Gurindar S. Sohi
Pages: 169-180
doi>10.1145/1508244.1508265
Full text: Pdf

Future processors are expected to observe increasing rates of hardware faults. Using Dual-Modular Redundancy (DMR), two cores of a multicore can be loosely coupled to redundantly execute a single software thread, providing very high coverage from many ... expand
ISOLATOR: dynamically ensuring isolation in comcurrent programs
Sriram Rajamani, G. Ramalingam, Venkatesh Prasad Ranganath, Kapil Vaswani
Pages: 181-192
doi>10.1145/1508244.1508266
Full text: Pdf

In this paper, we focus on concurrent programs that use locks to achieve isolation of data accessed by critical sections of code. We present ISOLATOR, an algorithm that guarantees isolation for well-behaved threads of a program that obey a locking discipline ... expand
Efficient online validation with delta execution
Joseph Tucek, Weiwei Xiong, Yuanyuan Zhou
Pages: 193-204
doi>10.1145/1508244.1508267
Full text: Pdf

Software systems are constantly changing. Patches to fix bugs and patches to add features are all too common. Every change risks breaking a previously working system. Hence administrators loathe change, and are willing to delay even critical security ... expand
SESSION: Power and storage in enterprise systems
PowerNap: eliminating server idle power
David Meisner, Brian T. Gold, Thomas F. Wenisch
Pages: 205-216
doi>10.1145/1508244.1508269
Full text: Pdf

Data center power consumption is growing to unprecedented levels: the EPA estimates U.S. data centers will consume 100 billion kilowatt hours annually by 2011. Much of this energy is wasted in idle systems: in typical deployments, server utilization ... expand
Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications
Adrian M. Caulfield, Laura M. Grupp, Steven Swanson
Pages: 217-228
doi>10.1145/1508244.1508270
Full text: Pdf

As our society becomes more information-driven, we have begun to amass data at an astounding and accelerating rate. At the same time, power concerns have made it difficult to bring the necessary processing power to bear on querying, processing, and understanding ... expand
DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings
Aayush Gupta, Youngjae Kim, Bhuvan Urgaonkar
Pages: 229-240
doi>10.1145/1508244.1508271
Full text: Pdf

Recent technological advances in the development of flash-memory based devices have consolidated their leadership position as the preferred storage media in the embedded systems market and opened new vistas for deployment in enterprise-scale storage ... expand
SESSION: Potpourri
Commutativity analysis for software parallelization: letting program transformations see the big picture
Farhana Aleen, Nathan Clark
Pages: 241-252
doi>10.1145/1508244.1508273
Full text: Pdf

Extracting performance from many-core architectures requires software engineers to create multi-threaded applications, which significantly complicates the already daunting task of software development. One solution to this problem is automatic compile-time ... expand
Accelerating critical section execution with asymmetric multi-core architectures
M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, Yale N. Patt
Pages: 253-264
doi>10.1145/1508244.1508274
Full text: Pdf

To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ... expand
Producing wrong data without doing anything obviously wrong!
Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, Peter F. Sweeney
Pages: 265-276
doi>10.1145/1508244.1508275
Full text: Pdf

This paper presents a surprising result: changing a seemingly innocuous aspect of an experimental setup can cause a systems researcher to draw wrong conclusions from an experiment. What appears to be an innocuous aspect in the experimental setup may ... expand
SESSION: Managed systems
Leak pruning
Michael D. Bond, Kathryn S. McKinley
Pages: 277-288
doi>10.1145/1508244.1508277
Full text: Pdf

Managed languages improve programmer productivity with type safety and garbage collection, which eliminate memory errors such as dangling pointers, double frees, and buffer overflows. However, because garbage collection uses reachability to over-approximate ... expand
Dynamic prediction of collection yield for managed runtimes
Michal Wegiel, Chandra Krintz
Pages: 289-300
doi>10.1145/1508244.1508278
Full text: Pdf

The growth in complexity of modern systems makes it increasingly d	Technology for developing regions: Moore's law is not enough
Eric A. Brewer
Pages: 1-2
doi>10.1145/1736020.1736021
Full text: PDF

The historic focus of development has rightfully been on macroeconomics and good governance, but technology has an increasingly large role to play. In this talk, I review several novel technologies that we have deployed in India and Africa, and discuss ... expand
SESSION: Novel architectures
Dynamically replicated memory: building reliable systems from nanoscale resistive memories
Engin Ipek, Jeremy Condit, Edmund B. Nightingale, Doug Burger, Thomas Moscibroda
Pages: 3-14
doi>10.1145/1736020.1736023
Full text: PDF

DRAM is facing severe scalability challenges in sub-45nm tech- nology nodes due to precise charge placement and sensing hur- dles in deep-submicron geometries. Resistive memories, such as phase-change memory (PCM), already scale well beyond DRAM and ... expand
A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing
Nevin Kirman, Jos� F. Mart�nez
Pages: 15-28
doi>10.1145/1736020.1736024
Full text: PDF

We present an all-optical approach to constructing data networks on chip that combines the following key features: (1) Wavelength-based routing, where the route followed by a packet depends solely on the wavelength of its carrier signal, and not on information ... expand
SESSION: Compilers and run-time systems
A real system evaluation of hardware atomicity for software speculation
Naveen Neelakantam, David R. Ditzel, Craig Zilles
Pages: 29-38
doi>10.1145/1736020.1736026
Full text: PDF

In this paper we evaluate the atomic region compiler abstraction by incorporating it into a commercial system. We find that atomic regions are simple and intuitive to integrate into an x86 binary-translation system. Furthermore, doing so trivially enables ... expand
Dynamic filtering: multi-purpose architecture support for language runtime systems
Tim Harris, Sa?a Tomic, Adri�n Cristal, Osman Unsal
Pages: 39-52
doi>10.1145/1736020.1736027
Full text: PDF

This paper introduces a new abstraction to accelerate the read-barriers and write-barriers used by language runtime systems. We exploit the fact that, dynamically, many barrier executions perform checks but no real work -- e.g., in generational garbage ... expand
SESSION: Parallel programming 1
CoreDet: a compiler and runtime system for deterministic multithreaded execution
Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, Dan Grossman
Pages: 53-64
doi>10.1145/1736020.1736029
Full text: PDF

The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many ... expand
Speculative parallelization using software multi-threaded transactions
Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, David I. August
Pages: 65-76
doi>10.1145/1736020.1736030
Full text: PDF

With the right techniques, multicore architectures may be able to continue the exponential performance trend that elevated the performance of applications of all types for decades. While many scientific programs can be parallelized without speculative ... expand
Respec: efficient online multiprocessor replayvia speculation and external determinism
Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, Jason Flinn
Pages: 77-90
doi>10.1145/1736020.1736031
Full text: PDF

Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still ... expand
SESSION: Scheduling in parallel systems
Probabilistic job symbiosis modeling for SMT processor scheduling
Stijn Eyerman, Lieven Eeckhout
Pages: 91-102
doi>10.1145/1736020.1736033
Full text: PDF

Symbiotic job scheduling boosts simultaneous multithreading (SMT) processor performance by co-scheduling jobs that have `compatible' demands on the processor's shared resources. Existing approaches however require a sampling phase, evaluate a limited ... expand
Request behavior variations
Kai Shen
Pages: 103-116
doi>10.1145/1736020.1736034
Full text: PDF

A large number of user requests execute (often concurrently) within a server system. A single request may exhibit fluctuating hardware characteristics (such as instruction completion rate and on-chip resource usage) over the course of its execution, ... expand
Decoupling contention management from scheduling
F. Ryan Johnson, Radu Stoica, Anastasia Ailamaki, Todd C. Mowry
Pages: 117-128
doi>10.1145/1736020.1736035
Full text: PDF

Many parallel applications exhibit unpredictable communication between threads, leading to contention for shared objects. The choice of contention management strategy impacts strongly the performance and scalability of these applications: spinning provides ... expand
Addressing shared resource contention in multicore processors via scheduling
Sergey Zhuravlev, Sergey Blagodurov, Alexandra Fedorova
Pages: 129-142
doi>10.1145/1736020.1736036
Full text: PDF

Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software ... expand
SESSION: Software reliability
SherLog: error diagnosis by connecting clues from run-time logs
Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, Shankar Pasupathy
Pages: 143-154
doi>10.1145/1736020.1736038
Full text: PDF

Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability ... expand
Analyzing multicore dumps to facilitate concurrency bug reproduction
Dasarath Weeratunge, Xiangyu Zhang, Suresh Jagannathan
Pages: 155-166
doi>10.1145/1736020.1736039
Full text: PDF

Debugging concurrent programs is difficult. This is primarily because the inherent non-determinism that arises because of scheduler interleavings makes it hard to easily reproduce bugs that may manifest only under certain interleavings. The problem is ... expand
A randomized scheduler with probabilistic guarantees of finding bugs
Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, Santosh Nagarakatte
Pages: 167-178
doi>10.1145/1736020.1736040
Full text: PDF

This paper presents a randomized scheduler for finding concurrency bugs. Like current stress-testing methods, it repeatedly runs a given test program with supplied inputs. However, it improves on stress-testing by finding buggy schedules more effectively ... expand
ConMem: detecting severe concurrency bugs through an effect-oriented approach
Wei Zhang, Chong Sun, Shan Lu
Pages: 179-192
doi>10.1145/1736020.1736041
Full text: PDF

Multicore technology is making concurrent programs increasingly pervasive. Unfortunately, it is difficult to deliver reliable concurrent programs, because of the huge and non-deterministic interleaving space. In reality, without the resources to thoroughly ... expand
SESSION: Hardware power and energy
Characterizing processor thermal behavior
Francisco Javier Mesa-Martinez, Ehsan K. Ardestani, Jose Renau
Pages: 193-204
doi>10.1145/1736020.1736043
Full text: PDF

Temperature is a dominant factor in the performance, reliability, and leakage power consumption of modern processors. As a result, increasing numbers of researchers evaluate thermal characteristics in their proposals. In this paper, we measure a real ... expand
Conservation cores: reducing the energy of mature computations
Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, Michael Bedford Taylor
Pages: 205-218
doi>10.1145/1736020.1736044
Full text: PDF

Growing transistor counts, limited power budgets, and the breakdown of voltage scaling are currently conspiring to create a utilization wall that limits the fraction of a chip that can run at full speed at one time. In this regime, specialized, ... expand
Micro-pages: increasing DRAM efficiency with locality-aware data placement
Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, Al Davis
Pages: 219-230
doi>10.1145/1736020.1736045
Full text: PDF

Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems ... expand
SESSION: Data centers
Power routing: dynamic power provisioning in the data center
Steven Pelley, David Meisner, Pooya Zandevakili, Thomas F. Wenisch, Jack Underwood
Pages: 231-242
doi>10.1145/1736020.1736047
Full text: PDF

Data center power infrastructure incurs massive capital costs, which typically exceed energy costs over the life of the facility. To squeeze maximum value from the infrastructure, researchers have proposed over-subscribing power circuits, relying on ... expand
Joint optimization of idle and cooling power in data centers while maintaining response time
Faraz Ahmad, T. N. Vijaykumar
Pages: 243-256
doi>10.1145/1736020.1736048
Full text: PDF

Server power and cooling power amount to a significant fraction of modern data centers' recurring costs. While data centers provision enough servers to guarantee response times under the maximum loading, data centers operate under much less loading most ... expand
SESSION: Hardware monitoring
Butterfly analysis: adapting dataflow analysis to dynamic parallel monitoring
Michelle L. Goodstein, Evangelos Vlachos, Shimin Chen, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry
Pages: 257-270
doi>10.1145/1736020.1736050
Full text: PDF

Online program monitoring is an effective technique for detecting bugs and security attacks in running applications. Extending these tools to monitor parallel programs is challenging because the tools must account for inter-thread dependences and relaxed ... expand
ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications
Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Babak Falsafi, Phillip B. Gibbons, Todd C. Mowry
Pages: 271-284
doi>10.1145/1736020.1736051
Full text: PDF

Instruction-grain lifeguards monitor the events of a running application at the level of individual instructions in order to identify and help mitigate application bugs and security exploits. Because such lifeguards impose a 10-100X slowdown on ... expand
SESSION: Parallel programming 2
MacroSS: macro-SIMDization of streaming applications
Amir H. Hormati, Yoonseo Choi, Mark Woh, Manjunath Kudlur, Rodric Rabbah, Trevor Mudge, Scott Mahlke
Pages: 285-296
doi>10.1145/1736020.1736053
Full text: PDF

SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many application ... expand
COMPASS: a programmable data prefetcher using idle GPU shaders
Dong Hyuk Woo, Hsien-Hsin S. Lee
Pages: 297-310
doi>10.1145/1736020.1736054
Full text: PDF

A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-cost ... expand
Flexible architectural support for fine-grain scheduling
Daniel Sanchez, Richard M. Yoo, Christos Kozyrakis
Pages: 311-322
doi>10.1145/1736020.1736055
Full text: PDF

To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, managing tasks of a few thousand instructions is particularly challenging, as the runtime must ensure load balance without ... expand
SESSION: Parallel memory systems
Specifying and dynamically verifying address translation-aware memory consistency
Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin
Pages: 323-334
doi>10.1145/1736020.1736057
Full text: PDF

Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation (AT) systems. Detecting bugs and faults requires a clear specification of correct behavior. To address this need, we develop a framework ... expand
Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, Yale N. Patt
Pages: 335-346
doi>10.1145/1736020.1736058
Full text: PDF

Cores in a chip-multiprocessor (CMP) system share multiple hardware resources in the memory subsystem. If resource sharing is unfair, some applications can be delayed significantly while others are unfairly prioritized. Previous research proposed separate ... expand
An asymmetric distributed shared memory model for heterogeneous parallel systems
Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, Wen-mei W. Hwu
Pages: 347-358
doi>10.1145/1736020.1736059
Full text: PDF

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to ... expand
Inter-core cooperative TLB for chip multiprocessors
Abhishek Bhattacharjee, Margaret Martonosi
Pages: 359-370
doi>10.1145/1736020.1736060
Full text: PDF

Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ... expand
SESSION: Security and hardware reliability
Orthrus: efficient software integrity protection on multi-cores
Ruirui Huang, Daniel Y. Deng, G. Edward Suh
Pages: 371-384
doi>10.1145/1736020.1736062
Full text: PDF

This paper proposes an efficient hardware/software system that significantly enhances software security through diversified replication on multi-cores. Recent studies show that a large class of software attacks can be detected by running multiple versions ... expand
Shoestring: probabilistic soft error reliability on the cheap
Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott Mahlke
Pages: 385-396
doi>10.1145/1736020.1736063
Full text: PDF

Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to ... expand
Virtualized and flexible ECC for main memory
Doe Hyun Yoon, Mattan Erez
Pages: 397-408
doi>10.1145/1736020.1736064
Full text: PDF

We present a general scheme for virtualizing main memory error-correction mechanisms, which map redundant information needed to correct errors into the memory namespace itself. We rely on this basic idea, which increases flexibility to increase error ... expand

ifficult to extract high-performance. The software stacks for such systems typically consist of multiple layers and include managed runtime environments (MREs). In this paper, we investigate ... expand
TwinDrivers: semi-automatic derivation of fast and safe hypervisor network drivers from guest OS drivers
Aravind Menon, Simon Schubert, Willy Zwaenepoel
Pages: 301-312
doi>10.1145/1508244.1508279
Full text: Pdf

In a virtualized environment, device drivers are often run inside a virtual machine (VM) rather than in the hypervisor, for reasons of safety and reduction in software engineering effort. Unfortunately, this approach results in poor performance for I/O-intensive ... expand
SESSION: Architectures
Phantom-BTB: a virtualized branch target buffer design
Ioana Burcea, Andreas Moshovos
Pages: 313-324
doi>10.1145/1508244.1508281
Full text: Pdf

Modern processors use branch target buffers (BTBs) to predict the target address of branches such that they can fetch ahead in the instruction stream increasing concurrency and performance. Ideally, BTBs would be sufficiently large to capture the entire ... expand
StreamRay: a stream filtering architecture for coherent ray tracing
Karthik Ramani, Christiaan P. Gribble, Al Davis
Pages: 325-336
doi>10.1145/1508244.1508282
Full text: Pdf

The wide availability of commodity graphics processors has made real-time graphics an intrinsic component of the human/computer interface. These graphics cores accelerate the z-buffer algorithm and provide a highly interactive experience at a relatively ... expand
Architectural support for SWAR text processing with parallel bit streams: the inductive doubling principle
Robert D. Cameron, Dan Lin
Pages: 337-348
doi>10.1145/1508244.1508283
Full text: Pdf

Parallel bit stream algorithms exploit the SWAR (SIMD within a register) capabilities of commodity processors in high-performance text processing applications such as UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression matching. ... expand

2010

	Technology for developing regions: Moore's law is not enough
Eric A. Brewer
Pages: 1-2
doi>10.1145/1736020.1736021
Full text: PDF

The historic focus of development has rightfully been on macroeconomics and good governance, but technology has an increasingly large role to play. In this talk, I review several novel technologies that we have deployed in India and Africa, and discuss ... expand
SESSION: Novel architectures
Dynamically replicated memory: building reliable systems from nanoscale resistive memories
Engin Ipek, Jeremy Condit, Edmund B. Nightingale, Doug Burger, Thomas Moscibroda
Pages: 3-14
doi>10.1145/1736020.1736023
Full text: PDF

DRAM is facing severe scalability challenges in sub-45nm tech- nology nodes due to precise charge placement and sensing hur- dles in deep-submicron geometries. Resistive memories, such as phase-change memory (PCM), already scale well beyond DRAM and ... expand
A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing
Nevin Kirman, Jos� F. Mart�nez
Pages: 15-28
doi>10.1145/1736020.1736024
Full text: PDF

We present an all-optical approach to constructing data networks on chip that combines the following key features: (1) Wavelength-based routing, where the route followed by a packet depends solely on the wavelength of its carrier signal, and not on information ... expand
SESSION: Compilers and run-time systems
A real system evaluation of hardware atomicity for software speculation
Naveen Neelakantam, David R. Ditzel, Craig Zilles
Pages: 29-38
doi>10.1145/1736020.1736026
Full text: PDF

In this paper we evaluate the atomic region compiler abstraction by incorporating it into a commercial system. We find that atomic regions are simple and intuitive to integrate into an x86 binary-translation system. Furthermore, doing so trivially enables ... expand
Dynamic filtering: multi-purpose architecture support for language runtime systems
Tim Harris, Sa?a Tomic, Adri�n Cristal, Osman Unsal
Pages: 39-52
doi>10.1145/1736020.1736027
Full text: PDF

This paper introduces a new abstraction to accelerate the read-barriers and write-barriers used by language runtime systems. We exploit the fact that, dynamically, many barrier executions perform checks but no real work -- e.g., in generational garbage ... expand
SESSION: Parallel programming 1
CoreDet: a compiler and runtime system for deterministic multithreaded execution
Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, Dan Grossman
Pages: 53-64
doi>10.1145/1736020.1736029
Full text: PDF

The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many ... expand
Speculative parallelization using software multi-threaded transactions
Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, David I. August
Pages: 65-76
doi>10.1145/1736020.1736030
Full text: PDF

With the right techniques, multicore architectures may be able to continue the exponential performance trend that elevated the performance of applications of all types for decades. While many scientific programs can be parallelized without speculative ... expand
Respec: efficient online multiprocessor replayvia speculation and external determinism
Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, Jason Flinn
Pages: 77-90
doi>10.1145/1736020.1736031
Full text: PDF

Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still ... expand
SESSION: Scheduling in parallel systems
Probabilistic job symbiosis modeling for SMT processor scheduling
Stijn Eyerman, Lieven Eeckhout
Pages: 91-102
doi>10.1145/1736020.1736033
Full text: PDF

Symbiotic job scheduling boosts simultaneous multithreading (SMT) processor performance by co-scheduling jobs that have `compatible' demands on the processor's shared resources. Existing approaches however require a sampling phase, evaluate a limited ... expand
Request behavior variations
Kai Shen
Pages: 103-116
doi>10.1145/1736020.1736034
Full text: PDF

A large number of user requests execute (often concurrently) within a server system. A single request may exhibit fluctuating hardware characteristics (such as instruction completion rate and on-chip resource usage) over the course of its execution, ... expand
Decoupling contention management from scheduling
F. Ryan Johnson, Radu Stoica, Anastasia Ailamaki, Todd C. Mowry
Pages: 117-128
doi>10.1145/1736020.1736035
Full text: PDF

Many parallel applications exhibit unpredictable communication between threads, leading to contention for shared objects. The choice of contention management strategy impacts strongly the performance and scalability of these applications: spinning provides ... expand
Addressing shared resource contention in multicore processors via scheduling
Sergey Zhuravlev, Sergey Blagodurov, Alexandra Fedorova
Pages: 129-142
doi>10.1145/1736020.1736036
Full text: PDF

Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software ... expand
SESSION: Software reliability
SherLog: error diagnosis by connecting clues from run-time logs
Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, Shankar Pasupathy
Pages: 143-154
doi>10.1145/1736020.1736038
Full text: PDF

Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability ... expand
Analyzing multicore dumps to facilitate concurrency bug reproduction
Dasarath Weeratunge, Xiangyu Zhang, Suresh Jagannathan
Pages: 155-166
doi>10.1145/1736020.1736039
Full text: PDF

Debugging concurrent programs is difficult. This is primarily because the inherent non-determinism that arises because of scheduler interleavings makes it hard to easily reproduce bugs that may manifest only under certain interleavings. The problem is ... expand
A randomized scheduler with probabilistic guarantees of finding bugs
Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, Santosh Nagarakatte
Pages: 167-178
doi>10.1145/1736020.1736040
Full text: PDF

This paper presents a randomized scheduler for finding concurrency bugs. Like current stress-testing methods, it repeatedly runs a given test program with supplied inputs. However, it improves on stress-testing by finding buggy schedules more effectively ... expand
ConMem: detecting severe concurrency bugs through an effect-oriented approach
Wei Zhang, Chong Sun, Shan Lu
Pages: 179-192
doi>10.1145/1736020.1736041
Full text: PDF

Multicore technology is making concurrent programs increasingly pervasive. Unfortunately, it is difficult to deliver reliable concurrent programs, because of the huge and non-deterministic interleaving space. In reality, without the resources to thoroughly ... expand
SESSION: Hardware power and energy
Characterizing processor thermal behavior
Francisco Javier Mesa-Martinez, Ehsan K. Ardestani, Jose Renau
Pages: 193-204
doi>10.1145/1736020.1736043
Full text: PDF

Temperature is a dominant factor in the performance, reliability, and leakage power consumption of modern processors. As a result, increasing numbers of researchers evaluate thermal characteristics in their proposals. In this paper, we measure a real ... expand
Conservation cores: reducing the energy of mature computations
Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, Michael Bedford Taylor
Pages: 205-218
doi>10.1145/1736020.1736044
Full text: PDF

Growing transistor counts, limited power budgets, and the breakdown of voltage scaling are currently conspiring to create a utilization wall that limits the fraction of a chip that can run at full speed at one time. In this regime, specialized, ... expand
Micro-pages: increasing DRAM efficiency with locality-aware data placement
Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, Al Davis
Pages: 219-230
doi>10.1145/1736020.1736045
Full text: PDF

Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems ... expand
SESSION: Data centers
Power routing: dynamic power provisioning in the data center
Steven Pelley, David Meisner, Pooya Zandevakili, Thomas F. Wenisch, Jack Underwood
Pages: 231-242
doi>10.1145/1736020.1736047
Full text: PDF

Data center power infrastructure incurs massive capital costs, which typically exceed energy costs over the life of the facility. To squeeze maximum value from the infrastructure, researchers have proposed over-subscribing power circuits, relying on ... expand
Joint optimization of idle and cooling power in data centers while maintaining response time
Faraz Ahmad, T. N. Vijaykumar
Pages: 243-256
doi>10.1145/1736020.1736048
Full text: PDF

Server power and cooling power amount to a significant fraction of modern data centers' recurring costs. While data centers provision enough servers to guarantee response times under the maximum loading, data centers operate under much less loading most ... expand
SESSION: Hardware monitoring
Butterfly analysis: adapting dataflow analysis to dynamic parallel monitoring
Michelle L. Goodstein, Evangelos Vlachos, Shimin Chen, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry
Pages: 257-270
doi>10.1145/1736020.1736050
Full text: PDF

Online program monitoring is an effective technique for detecting bugs and security attacks in running applications. Extending these tools to monitor parallel programs is challenging because the tools must account for inter-thread dependences and relaxed ... expand
ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications
Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Babak Falsafi, Phillip B. Gibbons, Todd C. Mowry
Pages: 271-284
doi>10.1145/1736020.1736051
Full text: PDF

Instruction-grain lifeguards monitor the events of a running application at the level of individual instructions in order to identify and help mitigate application bugs and security exploits. Because such lifeguards impose a 10-100X slowdown on ... expand
SESSION: Parallel programming 2
MacroSS: macro-SIMDization of streaming applications
Amir H. Hormati, Yoonseo Choi, Mark Woh, Manjunath Kudlur, Rodric Rabbah, Trevor Mudge, Scott Mahlke
Pages: 285-296
doi>10.1145/1736020.1736053
Full text: PDF

SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many application ... expand
COMPASS: a programmable data prefetcher using idle GPU shaders
Dong Hyuk Woo, Hsien-Hsin S. Lee
Pages: 297-310
doi>10.1145/1736020.1736054
Full text: PDF

A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-cost ... expand
Flexible architectural support for fine-grain scheduling
Daniel Sanchez, Richard M. Yoo, Christos Kozyrakis
Pages: 311-322
doi>10.1145/1736020.1736055
Full text: PDF

To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, managing tasks of a few thousand instructions is particularly challenging, as the runtime must ensure load balance without ... expand
SESSION: Parallel memory systems
Specifying and dynamically verifying address translation-aware memory consistency
Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin
Pages: 323-334
doi>10.1145/1736020.1736057
Full text: PDF

Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation (AT) systems. Detecting bugs and faults requires a clear specification of correct behavior. To address this need, we develop a framework ... expand
Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, Yale N. Patt
Pages: 335-346
doi>10.1145/1736020.1736058
Full text: PDF

Cores in a chip-multiprocessor (CMP) system share multiple hardware resources in the memory subsystem. If resource sharing is unfair, some applications can be delayed significantly while others are unfairly prioritized. Previous research proposed separate ... expand
An asymmetric distributed shared memory model for heterogeneous parallel systems
Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, Wen-mei W. Hwu
Pages: 347-358
doi>10.1145/1736020.1736059
Full text: PDF

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to ... expand
Inter-core cooperative TLB for chip multiprocessors
Abhishek Bhattacharjee, Margaret Martonosi
Pages: 359-370
doi>10.1145/1736020.1736060
Full text: PDF

Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ... expand
SESSION: Security and hardware reliability
Orthrus: efficient software integrity protection on multi-cores
Ruirui Huang, Daniel Y. Deng, G. Edward Suh
Pages: 371-384
doi>10.1145/1736020.1736062
Full text: PDF

This paper proposes an efficient hardware/software system that significantly enhances software security through diversified replication on multi-cores. Recent studies show that a large class of software attacks can be detected by running multiple versions ... expand
Shoestring: probabilistic soft error reliability on the cheap
Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott Mahlke
Pages: 385-396
doi>10.1145/1736020.1736063
Full text: PDF

Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to ... expand
Virtualized and flexible ECC for main memory
Doe Hyun Yoon, Mattan Erez
Pages: 397-408
doi>10.1145/1736020.1736064
Full text: PDF

We present a general scheme for virtualizing main memory error-correction mechanisms, which map redundant information needed to correct errors into the memory namespace itself. We rely on this basic idea, which increases flexibility to increase error ... expand

2011

	
The cloud will change everything
James R. Larus
Pages: 1-2
doi>10.1145/1950365.1950367
Full text: PDF

Cloud computing is fast on its way to becoming a meaningless, oversold marketing slogan. In the midst of this hype, it is easy to overlook the fundamental change that is occurring. Computation, which used to be confined to the machine beside your desk, ... expand
SESSION: Better logging support for software debugging
Michael Swift
Improving software diagnosability via log enhancement
Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, Stefan Savage
Pages: 3-14
doi>10.1145/1950365.1950369
Full text: PDF

Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental complexity of trouble-shooting any complex software system, but further exacerbated by the paucity of information that is typically available in the production ... expand
DoublePlay: parallelizing sequential logging and replay
Kaushik Veeraraghavan, Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn, Satish Narayanasamy
Pages: 15-26
doi>10.1145/1950365.1950370
Full text: PDF

Deterministic replay systems record and reproduce the execution of a hardware or software system. In contrast to replaying execution on uniprocessors, deterministic replay on multiprocessors is very challenging to implement efficiently because of the ... expand
SESSION: Understanding and improving transactional memory
Michael Swift
Hardware acceleration of transactional memory on commodity systems
Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan G. Bronson, Christos Kozyrakis, Kunle Olukotun
Pages: 27-38
doi>10.1145/1950365.1950372
Full text: PDF

The adoption of transactional memory is hindered by the high overhead of software transactional memory and the intrusive design changes required by previously proposed TM hardware. We propose that hardware to accelerate software transactional memory ... expand
Hybrid NOrec: a case study in the effectiveness of best effort hardware transactional memory
Luke Dalessandro, Fran�ois Carouge, Sean White, Yossi Lev, Mark Moir, Michael L. Scott, Michael F. Spear
Pages: 39-52
doi>10.1145/1950365.1950373
Full text: PDF

Transactional memory (TM) is a promising synchronization mechanism for the next generation of multicore processors. Best-effort Hardware Transactional Memory (HTM) designs, such as Sun's prototype Rock processor and AMD's proposed Advanced Synchronization ... expand
SESSION: Innovations in memory ordering models for parallel machines
James Laudon
Efficient processor support for DRFx, a memory model with exceptions
Abhayendra Singh, Daniel Marino, Satish Narayanasamy, Todd Millstein, Madan Musuvathi
Pages: 53-66
doi>10.1145/1950365.1950375
Full text: PDF

A longstanding challenge of shared-memory concurrency is to provide a memory model that allows for efficient implementation while providing strong and simple guarantees to programmers. The C++0x and Java memory models admit a wide variety of compiler ... expand
RCDC: a relaxed consistency deterministic computer
Joseph Devietti, Jacob Nelson, Tom Bergan, Luis Ceze, Dan Grossman
Pages: 67-78
doi>10.1145/1950365.1950376
Full text: PDF

Providing deterministic execution significantly simplifies the debugging, testing, replication, and deployment of multithreaded programs. Recent work has developed deterministic multiprocessor architectures as well as compiler and runtime systems that ... expand
Specifying and checking semantic atomicity for multithreaded programs
Jacob Burnim, George Necula, Koushik Sen
Pages: 79-90
doi>10.1145/1950365.1950377
Full text: PDF

In practice, it is quite difficult to write correct multithreaded programs due to the potential for unintended and nondeterministic interference between parallel threads. A fundamental correctness property for such programs is atomicity---a block of ... expand
SESSION: Programming for persistent memory
Thomas F. Wenisch
Mnemosyne: lightweight persistent memory
Haris Volos, Andres Jaan Tack, Michael M. Swift
Pages: 91-104
doi>10.1145/1950365.1950379
Full text: PDF

New storage-class memory (SCM) technologies, such as phase-change memory, STT-RAM, and memristors, promise user-level access to non-volatile storage through regular memory instructions. These memory devices enable fast user-mode access to persistence, ... expand
NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories
Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, Steven Swanson
Pages: 105-118
doi>10.1145/1950365.1950380
Full text: PDF

Persistent, user-defined objects present an attractive abstraction for working with non-volatile program state. However, the slow speed of persistent storage (i.e., disk) has restricted their design and limited their performance. Fast, byte-addressable, ... expand
SESSION: Enhancing device driver reliability
Yuanyuan Zhou
A declarative language approach to device configuration
Adrian Sch�pbach, Andrew Baumann, Timothy Roscoe, Simon Peter
Pages: 119-132
doi>10.1145/1950365.1950382
Full text: PDF

C remains the language of choice for hardware programming (device drivers, bus configuration, etc.): it is fast, allows low-level access, and is trusted by OS developers. However, the algorithms required to configure and reconfigure hardware devices ... expand
Improved device driver reliability through hardware verification reuse
Leonid Ryzhyk, John Keys, Balachandra Mirla, Arun Raghunath, Mona Vij, Gernot Heiser
Pages: 133-144
doi>10.1145/1950365.1950383
Full text: PDF

Faulty device drivers are a major source of operating system failures. We argue that the underlying cause of many driver faults is the separation of two highly-related tasks: device verification and driver development. These two tasks have a lot in common, ... expand
SESSION: Novel computing platforms
Luis Ceze
A case for neuromorphic ISAs
Atif Hashmi, Andrew Nere, James Jamal Thomas, Mikko Lipasti
Pages: 145-158
doi>10.1145/1950365.1950385
Full text: PDF

The desire to create novel computing systems, paired with recent advances in neuroscientific understanding of the brain, has led researchers to develop neuromorphic architectures that emulate the brain. To date, such models are developed, trained, and ... expand
Mementos: system support for long-running computation on RFID-scale devices
Benjamin Ransford, Jacob Sorber, Kevin Fu
Pages: 159-170
doi>10.1145/1950365.1950386
Full text: PDF

Transiently powered computing devices such as RFID tags, kinetic energy harvesters, and smart cards typically rely on programs that complete a task under tight time constraints before energy starvation leads to complete loss of volatile memory. Mementos ... expand
Pocket cloudlets
Emmanouil Koukoumidis, Dimitrios Lymberopoulos, Karin Strauss, Jie Liu, Doug Burger
Pages: 171-184
doi>10.1145/1950365.1950387
Full text: PDF

Cloud services accessed through mobile devices suffer from high network access latencies and are constrained by energy budgets dictated by the devices' batteries. Radio and battery technologies will improve over time, but are still expected to be the ... expand
SESSION: Saving power and energy
Jim Larus
Blink: managing server clusters on intermittent power
Navin Sharma, Sean Barker, David Irwin, Prashant Shenoy
Pages: 185-198
doi>10.1145/1950365.1950389
Full text: PDF

Reducing the energy footprint of data centers continues to receive significant attention due to both its financial and environmental impact. There are numerous methods that limit the impact of both factors, such as expanding the use of renewable energy ... expand
Dynamic knobs for responsive power-aware computing
Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal, Martin Rinard
Pages: 199-212
doi>10.1145/1950365.1950390
Full text: PDF

We present PowerDial, a system for dynamically adapting application behavior to execute successfully in the face of load and power fluctuations. PowerDial transforms static configuration parameters into dynamic knobs that the PowerDial control system ... expand
Flikker: saving DRAM refresh-power through critical data partitioning
Song Liu, Karthik Pattabiraman, Thomas Moscibroda, Benjamin G. Zorn
Pages: 213-224
doi>10.1145/1950365.1950391
Full text: PDF

Energy has become a first-class design constraint in computer systems. Memory is a significant contributor to total system power. This paper introduces Flikker, an application-level technique to reduce refresh power in DRAM memories. Flikker enables ... expand
MemScale: active low-power modes for main memory
Qingyuan Deng, David Meisner, Luiz Ramos, Thomas F. Wenisch, Ricardo Bianchini
Pages: 225-238
doi>10.1145/1950365.1950392
Full text: PDF

Main memory is responsible for a large and increasing fraction of the energy consumed by servers. Prior work has focused on exploiting DRAM low-power states to conserve energy. However, these states require entire DRAM ranks to be idled, which is difficult ... expand
SESSION: Recognizing software and concurrency bugs
Emery Berger
2ndStrike: toward manifesting hidden concurrency typestate bugs
Qi Gao, Wenbin Zhang, Zhezhe Chen, Mai Zheng, Feng Qin
Pages: 239-250
doi>10.1145/1950365.1950394
Full text: PDF

Concurrency bugs are becoming increasingly prevalent in the multi-core era. Recently, much research has focused on data races and atomicity violation bugs, which are related to low-level memory accesses. However, a large number of concurrency typestate ... expand
ConSeq: detecting concurrency bugs through sequential errors
Wei Zhang, Junghee Lim, Ramya Olichandran, Joel Scherpelz, Guoliang Jin, Shan Lu, Thomas Reps
Pages: 251-264
doi>10.1145/1950365.1950395
Full text: PDF

Concurrency bugs are caused by non-deterministic interleavings between shared memory accesses. Their effects propagate through data and control dependences until they cause software to crash, hang, produce incorrect output, etc. The lifecycle of a bug ... expand
S2E: a platform for in-vivo multi-path analysis of software systems
Vitaly Chipounov, Volodymyr Kuznetsov, George Candea
Pages: 265-278
doi>10.1145/1950365.1950396
Full text: PDF

This paper presents S2E, a platform for analyzing the properties and behavior of software systems. We demonstrate S2E's use in developing practical tools for comprehensive performance profiling, reverse engineering of proprietary software, and bug finding ... expand
SESSION: Rethinking and protecting operating systems
Orran Krieger
Ensuring operating system kernel integrity with OSck
Owen S. Hofmann, Alan M. Dunn, Sangman Kim, Indrajit Roy, Emmett Witchel
Pages: 279-290
doi>10.1145/1950365.1950398
Full text: PDF

Kernel rootkits that modify operating system state to avoid detection are a dangerous threat to system security. This paper presents OSck, a system that discovers kernel rootkits by detecting malicious modifications to operating system data. OSck integrates ... expand
Rethinking the library OS from the top down
Donald E. Porter, Silas Boyd-Wickizer, Jon Howell, Reuben Olinsky, Galen C. Hunt
Pages: 291-304
doi>10.1145/1950365.1950399
Full text: PDF

This paper revisits an old approach to operating system construc-tion, the library OS, in a new context. The idea of the library OS is that the personality of the OS on which an application depends runs in the address space of the application. A small, ... expand
SESSION: Learning from the past: drawing conclusions from extensive measurement studies
Orran Krieger
Faults in linux: ten years later
Nicolas Palix, Ga�l Thomas, Suman Saha, Christophe Calv�s, Julia Lawall, Gilles Muller
Pages: 305-318
doi>10.1145/1950365.1950401
Full text: PDF

In 2001, Chou et al. published a study of faults found by applying a static analyzer to Linux versions 1.0 through 2.4.1. A major result of their work was that the drivers directory contained up to 7 times more of certain kinds of faults than other directories. ... expand
Looking back on the language and hardware revolutions: measured power, performance, and scaling
Hadi Esmaeilzadeh, Ting Cao, Yang Xi, Stephen M. Blackburn, Kathryn S. McKinley
Pages: 319-332
doi>10.1145/1950365.1950402
Full text: PDF

This paper reports and analyzes measured chip power and performance on five process technology generations executing 61 diverse benchmarks with a rigorous methodology. We measure representative Intel IA32 processors with technologies ranging from 130nm ... expand
SESSION: New compiler optimizations
Scott Mahlke
Synthesizing concurrent schedulers for irregular algorithms
Donald Nguyen, Keshav Pingali
Pages: 333-344
doi>10.1145/1950365.1950404
Full text: PDF

Scheduling is the assignment of tasks or activities to processors for execution, and it is an important concern in parallel programming. Most prior work on scheduling has focused either on static scheduling of applications in which the dependence graph ... expand
Exploring circuit timing-aware language and compilation
Giang Hoang, Robby Bruce Findler, Russ Joseph
Pages: 345-356
doi>10.1145/1950365.1950405
Full text: PDF

By adjusting the design of the ISA and enabling circuit timing-sensitive optimizations in a compiler, we can more effectively exploit timing speculation. While there has been growing interest in systems that leverage circuit-level timing speculation ... expand
Orchestration by approximation: mapping stream programs onto multicore architectures
Sardar M. Farhad, Yousun Ko, Bernd Burgstaller, Bernhard Scholz
Pages: 357-368
doi>10.1145/1950365.1950406
Full text: PDF

We present a novel 2-approximation algorithm for deploying stream graphs on multicore computers and a stream graph transformation that eliminates bottlenecks. The key technical insight is a data rate transfer model that enables the computation of a "closed ... expand
SESSION: Exploiting parallelism on GPUs
Kunle Olukuton
On-the-fly elimination of dynamic irregularities for GPU computing
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, Xipeng Shen
Pages: 369-380
doi>10.1145/1950365.1950408
Full text: PDF

The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control ... expand
Sponge: portable stream programming on graphics engines
Amir H. Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, Scott Mahlke
Pages: 381-392
doi>10.1145/1950365.1950409
Full text: PDF

Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, ... expand
SESSION: Novel performance improvements
Kunle Olukuton
Inter-core prefetching for multicore processors using migrating helper threads
Md Kamruzzaman, Steven Swanson, Dean M. Tullsen
Pages: 393-404
doi>10.1145/1950365.1950411
Full text: PDF

Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques ... expand
Improving the performance of trace-based systems by false loop filtering
Hiroshige Hayashizaki, Peng Wu, Hiroshi Inoue, Mauricio J. Serrano, Toshio Nakatani
Pages: 405-418
doi>10.1145/1950365.1950412
Full text: PDF

Trace-based compilation is a promising technique for language compilers and binary translators. It offers the potential to expand the compilation scopes that have traditionally been limited by method boundaries. Detecting repeating cyclic execution paths ... expand