| Cited by |
Paper title |
Year |
| 1501 |
McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. |
2009 |
| 941 |
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. |
2006 |
| 603 |
An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget. |
2006 |
| 552 |
Die Stacking (3D) Microarchitecture. |
2006 |
| 470 |
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. |
2009 |
| 465 |
Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling. |
2009 |
| 449 |
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. |
2007 |
| 420 |
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. |
2007 |
| 370 |
LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks. |
2006 |
| 365 |
Flattened Butterfly Topology for On-Chip Networks. |
2007 |
| 358 |
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. |
2006 |
| 358 |
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. |
2007 |
| 335 |
Characterizing flash memory: anomalies, observations, and applications. |
2009 |
| 334 |
Fair Queuing Memory Systems. |
2006 |
| 334 |
Flip-N-Write: a simple deterministic technique to improve PRAM write performance, energy and endurance. |
2009 |
| 321 |
Neural Acceleration for General-Purpose Approximate Programs. |
2012 |
| 302 |
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors. |
2006 |
| 299 |
ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers. |
2006 |
| 296 |
Into the wild: studying real user activity patterns to guide power optimizations for mobile architectures. |
2009 |
| 294 |
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. |
2010 |
| 285 |
Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. |
2011 |
| 282 |
Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management. |
2006 |
| 278 |
Automatic Thread Extraction with Decoupled Software Pipelining. |
2005 |
| 253 |
Improving GPU performance via large warps and two-level warp scheduling. |
2011 |
| 247 |
Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding. |
2007 |
| 241 |
Argus: Low-Cost, Comprehensive Error Detection in Simple Cores. |
2007 |
| 237 |
ASR: Adaptive Selective Replication for CMP Caches. |
2006 |
| 211 |
A Dynamic Compilation Framework for Controlling Microprocessor Energy and Performance. |
2005 |
| 211 |
Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach. |
2008 |
| 210 |
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? |
2010 |
| 210 |
Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories. |
2010 |
| 209 |
Architectural Support for Software Transactional Memory. |
2006 |
| 206 |
Cache-Conscious Wavefront Scheduling. |
2012 |
| 203 |
Mini-rank: Adaptive DRAM architecture for improving memory power efficiency. |
2008 |
| 197 |
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs. |
2007 |
| 192 |
Composable Lightweight Processors. |
2007 |
| 185 |
Facelift: Hiding and slowing down aging in multicores. |
2008 |
| 184 |
Penelope: The NBTI-Aware Processor. |
2007 |
| 180 |
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor. |
2006 |
| 179 |
Revisiting the Sequential Programming Model for Multi-Core. |
2007 |
| 172 |
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs. |
2007 |
| 170 |
Multi retention level STT-RAM cache designs with a dynamic refresh scheme. |
2011 |
| 169 |
DaDianNao: A Machine-Learning Supercomputer. |
2014 |
| 166 |
Reducing memory interference in multicore systems via application-aware memory channel partitioning. |
2011 |
| 165 |
Application-aware prioritization mechanisms for on-chip networks. |
2009 |
| 159 |
Reunion: Complexity-Effective Multicore Redundancy. |
2006 |
| 157 |
FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators. |
2007 |
| 156 |
Stream Programming on General-Purpose Processors. |
2005 |
| 153 |
Prefetch-Aware DRAM Controllers. |
2008 |
| 148 |
Low-cost router microarchitecture for on-chip networks. |
2009 |
| 148 |
SCARAB: a single cycle adaptive routing and bufferless network. |
2009 |
| 147 |
Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. |
2008 |
| 146 |
The ZCache: Decoupling Ways and Associativity. |
2010 |
| 146 |
Pack&Cap: adaptive DVFS and thread packing under power caps. |
2011 |
| 143 |
Understanding the Energy Consumption of Dynamic Random Access Memories. |
2010 |
| 139 |
A tagless coherence directory. |
2009 |
| 137 |
In-Network Cache Coherence. |
2006 |
| 137 |
Copy or Discard execution model for speculative parallelization on multicores. |
2008 |
| 136 |
Implementing Signatures for Transactional Memory. |
2007 |
| 136 |
Approximate storage in solid-state memories. |
2013 |
| 135 |
SAFER: Stuck-At-Fault Error Recovery for Memories. |
2010 |
| 134 |
Improving cache lifetime reliability at ultra-low voltages. |
2009 |
| 130 |
A Predictive Performance Model for Superscalar Processors. |
2006 |
| 129 |
A novel cache architecture with enhanced performance and security. |
2008 |
| 129 |
Coordinated control of multiple prefetchers in multi-core systems. |
2009 |
| 129 |
SHiP: signature-based hit predictor for high performance caching. |
2011 |
| 128 |
Mitigating the Impact of Process Variations on Processor Register Files and Execution Units. |
2006 |
| 127 |
A Framework for Providing Quality of Service in Chip Multi-Processors. |
2007 |
| 125 |
Transactional Memory Architecture and Implementation for IBM System Z. |
2012 |
| 123 |
Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. |
2011 |
| 123 |
Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design. |
2012 |
| 120 |
Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches. |
2009 |
| 118 |
EazyHTM: eager-lazy hardware transactional memory. |
2009 |
| 118 |
Characterizing and mitigating the impact of process variations on phase change based memory systems. |
2009 |
| 115 |
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors. |
2005 |
| 115 |
Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. |
2009 |
| 113 |
Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer. |
2008 |
| 113 |
Token flow control. |
2008 |
| 113 |
Preemptive virtual clock: a flexible, efficient, and cost-effective QOS scheme for networks-on-chip. |
2009 |
| 112 |
Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing. |
2007 |
| 112 |
QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores. |
2011 |
| 111 |
CoScale: Coordinating CPU and Memory System DVFS in Server Systems. |
2012 |
| 111 |
SAGE: self-tuning approximation for graphics engines. |
2013 |
| 110 |
Yield-Aware Cache Architectures. |
2006 |
| 108 |
Process Variation Tolerant 3T1D-Based Cache Architectures. |
2007 |
| 107 |
Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation. |
2007 |
| 107 |
Quality programmable vector processors for approximate computing. |
2013 |
| 106 |
Sampling Dead Block Prediction for Last-Level Caches. |
2010 |
| 105 |
Light speed arbitration and flow control for nanophotonic interconnects. |
2009 |
| 104 |
Dependence-aware transactional memory for increased concurrency. |
2008 |
| 104 |
Parallel application memory scheduling. |
2011 |
| 102 |
Self-calibrating Online Wearout Detection. |
2007 |
| 99 |
Leveraging 3D Technology for Improved Reliability. |
2007 |
| 99 |
Complexity effective memory access scheduling for many-core accelerator architectures. |
2009 |
| 99 |
Composite Cores: Pushing Heterogeneity Into a Core. |
2012 |
| 97 |
Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor. |
2005 |
| 96 |
From SODA to scotch: The evolution of a wireless baseband processor. |
2008 |
| 96 |
Task Superscalar: An Out-of-Order Task Pipeline. |
2010 |
| 96 |
Active management of timing guardband to save energy in POWER7. |
2011 |
| 94 |
A case for dynamic frequency tuning in on-chip networks. |
2009 |
| 94 |
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. |
2010 |
| 92 |
Adaptive Caches: Effective Shaping of Cache Behavior to Workloads. |
2006 |
| 92 |
Using Address Independent Seed Encryption and Bonsai Merkle Trees to Make Secure Processors OS- and Performance-Friendly. |
2007 |
| 91 |
The TM3270 Media-Processor. |
2005 |
| 91 |
EVAL: Utilizing processors with variation-induced timing errors. |
2008 |
| 90 |
Memory Prefetching Using Adaptive Stream Detection. |
2006 |
| 90 |
Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. |
2008 |
| 90 |
Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies. |
2010 |
| 89 |
Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications. |
2009 |
| 88 |
SD3: A Scalable Approach to Dynamic Data-Dependence Profiling. |
2010 |
| 87 |
Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era. |
2011 |
| 86 |
Bundled execution of recurring traces for energy-efficient general purpose processing. |
2011 |
| 85 |
Efficient unicast and multicast support for CMPs. |
2008 |
| 84 |
The BubbleWrap many-core: popping cores for sequential acceleration. |
2009 |
| 83 |
Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory. |
2010 |
| 82 |
Throughput-Effective On-Chip Networks for Manycore Accelerators. |
2010 |
| 81 |
Finding concurrency bugs with context-aware communication graphs. |
2009 |
| 80 |
The StageNet fabric for constructing resilient multicore systems. |
2008 |
| 80 |
Meet the walkers: accelerating index traversals for in-memory databases. |
2013 |
| 78 |
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems. |
2009 |
| 78 |
PACMan: prefetch-aware cache management for high performance caching. |
2011 |
| 78 |
Kiln: closing the performance gap between systems with and without persistence support. |
2013 |
| 77 |
A Quantum Logic Array Microarchitecture: Scalable Quantum Data Movement and Computation. |
2005 |
| 77 |
Low Vccmin fault-tolerant cache with highly predictable performance. |
2009 |
| 77 |
ZerehCache: armoring cache architectures in high defect density technologies. |
2009 |
| 76 |
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers. |
2006 |
| 75 |
Microarchitectural Design Space Exploration Using an Architecture-Centric Approach. |
2007 |
| 75 |
Pay-As-You-Go: low-overhead hard-error correction for phase change memories. |
2011 |
| 74 |
Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy. |
2009 |
| 73 |
Adaptive line placement with theset balancing cache. |
2009 |
| 73 |
Improving Cache Management Policies Using Dynamic Reuse Distances. |
2012 |
| 72 |
KnightShift: Scaling the Energy Proportionality Wall through Server-Level Heterogeneity. |
2012 |
| 70 |
NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers. |
2012 |
| 70 |
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor. |
2012 |
| 69 |
Scalable Store-Load Forwarding via Store Queue Index Prediction. |
2005 |
| 69 |
Improving memory bank-level parallelism in the presence of prefetching. |
2009 |
| 69 |
Divergence-aware warp scheduling. |
2013 |
| 68 |
RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization. |
2013 |
| 67 |
ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory. |
2010 |
| 67 |
SIMD re-convergence at thread frontiers. |
2011 |
| 66 |
NoSQ: Store-Load Communication without a Store Queue. |
2006 |
| 66 |
Power reduction of CMP communication networks via RF-interconnects. |
2008 |
| 66 |
Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. |
2012 |
| 64 |
The Cell Processor Architecture. |
2005 |
| 63 |
Tribeca: design for PVT variations with local recovery and fine-grained adaptation. |
2009 |
| 63 |
A Dynamically Adaptable Hardware Transactional Memory. |
2010 |
| 62 |
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy. |
2007 |
| 61 |
Shader Performance Analysis on a Modern GPU Architecture. |
2005 |
| 61 |
Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures. |
2007 |
| 61 |
Preventing PCM banks from seizing too much power. |
2011 |
| 60 |
Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions. |
2006 |
| 60 |
CPR: Composable performance regression for scalable multiprocessor models. |
2008 |
| 60 |
Notary: Hardware techniques to enhance signatures. |
2008 |
| 60 |
ESKIMO: Energy savings using Semantic Knowledge of Inconsequential Memory Occupancy for DRAM subsystem. |
2009 |
| 60 |
Hardware transactional memory for GPU architectures. |
2011 |
| 60 |
Predicting Performance Impact of DVFS for Realistic Memory Systems. |
2012 |
| 59 |
Coherence Ordering for Ring-based Chip Multiprocessors. |
2006 |
| 59 |
Scavenger: A New Last Level Cache Architecture with Global Block Priority. |
2007 |
| 59 |
Power to the people: Leveraging human physiological traits to control microprocessor frequency. |
2008 |
| 59 |
MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP. |
2012 |
| 59 |
Heterogeneous system coherence for integrated CPU-GPU systems. |
2013 |
| 58 |
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All. |
2006 |
| 58 |
Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication. |
2011 |
| 58 |
Architectural support for secure virtualization under a vulnerable hypervisor. |
2011 |
| 57 |
Pinot: Speculative Multi-threading Processor Architecture Exploiting Parallelism over a Wide Range of Granularities. |
2005 |
| 57 |
Fairness and Throughput in Switch on Event Multithreading. |
2006 |
| 56 |
SHARP control: controlled shared cache management in chip multiprocessors. |
2009 |
| 55 |
Thermal Management of On-Chip Caches Through Power Density Minimization. |
2005 |
| 55 |
Proactive transaction scheduling for contention management. |
2009 |
| 54 |
Portable compiler optimisation across embedded programs and microarchitectures using machine learning. |
2009 |
| 54 |
NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures. |
2012 |
| 54 |
CoLT: Coalesced Large-Reach TLBs. |
2012 |
| 53 |
A locality-aware memory hierarchy for energy-efficient GPU architectures. |
2013 |
| 52 |
Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution. |
2005 |
| 52 |
Improving Region Selection in Dynamic Optimization Systems. |
2005 |
| 52 |
Emulating Optimal Replacement with a Shepherd Cache. |
2007 |
| 52 |
Online design bug detection: RTL analysis, flexible mechanisms, and evaluation. |
2008 |
| 52 |
ReMAP: A Reconfigurable Heterogeneous Multicore Architecture. |
2010 |
| 52 |
Spatiotemporal Coherence Tracking. |
2012 |
| 51 |
Reconfigurable energy efficient near threshold cache architectures. |
2008 |
| 51 |
Adaptive Cache Management for Energy-Efficient GPU Computing. |
2014 |
| 50 |
Scalable Cache Miss Handling for High Memory-Level Parallelism. |
2006 |
| 49 |
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding. |
2005 |
| 49 |
Tradeoffs in designing accelerator architectures for visual computing. |
2008 |
| 49 |
Toward a multicore architecture for real-time ray-tracing. |
2008 |
| 49 |
Execution leases: a hardware-supported mechanism for enforcing strong non-interference. |
2009 |
| 49 |
In-network coherence filtering: snoopy coherence without broadcasts. |
2009 |
| 49 |
BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support. |
2009 |
| 49 |
Combating Aging with the Colt Duty Cycle Equalizer. |
2010 |
| 49 |
A Predictive Model for Dynamic Microarchitectural Adaptivity Control. |
2010 |
| 49 |
Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. |
2014 |
| 48 |
ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing. |
2005 |
| 48 |
CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs. |
2006 |
| 48 |
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling. |
2010 |
| 48 |
A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch. |
2012 |
| 47 |
Dynamic Standby Prediction for Leakage Tolerant Microprocessor Functional Units. |
2006 |
| 47 |
NBTI tolerant microarchitecture design in the presence of process variation. |
2008 |
| 47 |
Flexible and Efficient Instruction-Grained Run-Time Monitoring Using On-Chip Reconfigurable Fabric. |
2010 |
| 47 |
A compile-time managed multi-level register file hierarchy. |
2011 |
| 47 |
Linearly compressed pages: a low-complexity, low-latency main memory compression framework. |
2013 |
| 45 |
Continuous Path and Edge Profiling. |
2005 |
| 45 |
Token tenure: PATCHing token counting using directory-based cache coherence. |
2008 |
| 45 |
Memory Latency Reduction via Thread Throttling. |
2010 |
| 45 |
Fractal Coherence: Scalably Verifiable Cache Coherence. |
2010 |
| 45 |
A new case for the TAGE branch predictor. |
2011 |
| 45 |
FPB: Fine-grained Power Budgeting to Improve Write Throughput of Multi-level Cell Phase Change Memory. |
2012 |
| 45 |
Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching. |
2013 |
| 44 |
Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware. |
2006 |
| 44 |
Temporal instruction fetch streaming. |
2008 |
| 44 |
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy. |
2012 |
| 44 |
SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers. |
2014 |
| 43 |
Reducing peak power with a table-driven adaptive processor core. |
2009 |
| 42 |
“”“Flea-flicker”” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense. “ |
2005 |
| 42 |
Offline symbolic analysis for multi-processor execution replay. |
2009 |
| 42 |
Efficient Selection of Vector Instructions Using Dynamic Programming. |
2010 |
| 42 |
A resistive TCAM accelerator for data-intensive computing. |
2011 |
| 42 |
Rethinking DRAM Power Modes for Energy Proportionality. |
2012 |
| 42 |
FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems. |
2014 |
| 41 |
Adaptive data compression for high-performance low-power on-chip networks. |
2008 |
| 41 |
Low-power, high-performance analog neural branch prediction. |
2008 |
| 41 |
An hybrid eDRAM/SRAM macrocell to implement first-level data caches. |
2009 |
| 40 |
A Criticality Analysis of Clustering in Superscalar Processors. |
2005 |
| 40 |
Optimizing shared cache behavior of chip multiprocessors. |
2009 |
| 40 |
Multiple clock and voltage domains for chip multi processors. |
2009 |
| 40 |
Probabilistic Distance-Based Arbitration: Providing Equality of Service for Many-Core CMPs. |
2010 |
| 40 |
AtomTracker: A Comprehensive Approach to Atomic Region Inference and Violation Detection. |
2010 |
| 40 |
Dataflow execution of sequential imperative programs on multicore architectures. |
2011 |
| 40 |
Proactive instruction fetch. |
2011 |
| 40 |
Managing GPU Concurrency in Heterogeneous Architectures. |
2014 |
| 40 |
Load Value Approximation. |
2014 |
| 39 |
NOC-Out: Microarchitecting a Scale-Out Processor. |
2012 |
| 38 |
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors. |
2005 |
| 38 |
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors. |
2007 |
| 38 |
Adaptive Flow Control for Robust Performance and Energy. |
2010 |
| 38 |
Register Cache System Not for Latency Reduction Purpose. |
2010 |
| 38 |
Linearizing irregular memory accesses for improved correlated prefetching. |
2013 |
| 37 |
Automatic Parallelization in a Binary Rewriter. |
2010 |
| 37 |
Parichute: Generalized Turbocode-Based Error Correction for Near-Threshold Caches. |
2010 |
| 37 |
Large-reach memory management unit caches. |
2013 |
| 37 |
Multi-grain coherence directories. |
2013 |
| 37 |
Iso-X: A Flexible Architecture for Hardware-Managed Isolated Execution. |
2014 |
| 37 |
CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache. |
2014 |
| 36 |
Dataflow Predication. |
2006 |
| 36 |
Support for High-Frequency Streaming in CMPs. |
2006 |
| 36 |
Impact of Cache Coherence Protocols on the Processing of Network Traffic. |
2007 |
| 36 |
Architecting a chunk-based memory race recorder in modern CMPs. |
2009 |
| 36 |
The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory. |
2015 |
| 35 |
Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns. |
2005 |
| 34 |
uComplexity: Estimating Processor Design Effort. |
2005 |
| 34 |
Store Memory-Level Parallelism Optimizations for Commercial Applications. |
2005 |
| 34 |
Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications. |
2007 |
| 34 |
Informed Microarchitecture Design Space Exploration Using Workload Dynamics. |
2007 |
| 34 |
Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache. |
2007 |
| 34 |
Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs. |
2008 |
| 34 |
Encore: low-cost, fine-grained transient fault recovery. |
2011 |
| 33 |
DDT: design and evaluation of a dynamic program analysis for optimizing data structure usage. |
2009 |
| 33 |
PPEP: Online Performance, Power, and Energy Prediction Framework and DVFS Space Exploration. |
2014 |
| 32 |
Characterizing the resource-sharing levels in the UltraSPARC T2 processor. |
2009 |
| 32 |
Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors. |
2010 |
| 32 |
Packet chaining: efficient single-cycle allocation for on-chip networks. |
2011 |
| 32 |
Accurate Fine-Grained Processor Power Proxies. |
2012 |
| 32 |
Warped gates: gating aware scheduling and power gating for GPGPUs. |
2013 |
| 31 |
LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support. |
2010 |
| 31 |
System-level integrated server architectures for scale-out datacenters. |
2011 |
| 31 |
Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization. |
2012 |
| 31 |
A Practical Methodology for Measuring the Side-Channel Signal Available to the Attacker for Instruction-Level Events. |
2014 |
| 31 |
Random Fill Cache Architecture. |
2014 |
| 30 |
Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System. |
2005 |
| 30 |
PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection. |
2006 |
| 30 |
Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access. |
2012 |
| 30 |
Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device. |
2013 |
| 30 |
Transparent Hardware Management of Stacked DRAM as Part of Memory. |
2014 |
| 30 |
PORPLE: An Extensible Optimizer for Portable Data Placement on GPU. |
2014 |
| 29 |
How to Fake 1000 Registers. |
2005 |
| 29 |
Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology. |
2008 |
| 29 |
Scalable Speculative Parallelization on Commodity Clusters. |
2010 |
| 29 |
Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically. |
2012 |
| 29 |
Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks. |
2014 |
| 28 |
Authentication Control Point and Its Implications For Secure Processor Design. |
2006 |
| 28 |
The Art of Deception: Adaptive Precision Reduction for Area Efficient Physics Acceleration. |
2007 |
| 28 |
A small cache of large ranges: Hardware methods for efficiently searching, storing, and updating big dataflow tags. |
2008 |
| 28 |
A performance-correctness explicitly-decoupled architecture. |
2008 |
| 28 |
Light64: lightweight hardware support for data race detection during systematic testing of parallel programs. |
2009 |
| 27 |
Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths. |
2006 |
| 27 |
Tolerating Concurrency Bugs Using Transactions as Lifeguards. |
2010 |
| 27 |
Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks. |
2010 |
| 27 |
Accelerating microprocessor silicon validation by exposing ISA diversity. |
2011 |
| 27 |
CoreRacer: a practical memory race recorder for multicore x86 TSO processors. |
2011 |
| 27 |
Formally enhanced runtime verification to ensure NoC functional correctness. |
2011 |
| 27 |
Residue cache: a low-energy low-area L2 cache architecture via compression and partial hits. |
2011 |
| 27 |
Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator. |
2012 |
| 27 |
Warped-DMR: Light-weight Error Detection for GPGPU. |
2012 |
| 27 |
Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers. |
2014 |
| 26 |
Merging Head and Tail Duplication for Convergent Hyperblock Formation. |
2006 |
| 26 |
Shapeshifter: Dynamically changing pipeline width and speed to address process variations. |
2008 |
| 26 |
Control flow obfuscation with information flow tracking. |
2009 |
| 26 |
Ordering decoupled metadata accesses in multiprocessors. |
2009 |
| 26 |
STEM: Spatiotemporal Management of Capacity for Intra-core Last Level Caches. |
2010 |
| 26 |
Insertion and promotion for tree-based PseudoLRU last-level caches. |
2013 |
| 26 |
Trace based phase prediction for tightly-coupled heterogeneous cores. |
2013 |
| 26 |
Locality-Aware Mapping of Nested Parallel Patterns on GPUs. |
2014 |
| 26 |
CC-Hunter: Uncovering Covert Timing Channels on Shared Processor Hardware. |
2014 |
| 26 |
Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses. |
2015 |
| 25 |
Exploiting Vector Parallelism in Software Pipelined Loops. |
2005 |
| 25 |
Variation-tolerant non-uniform 3D cache management in die stacked multicore processor. |
2009 |
| 25 |
Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance. |
2012 |
| 25 |
AUDIT: Stress Testing the Automatic Way. |
2012 |
| 25 |
The reuse cache: downsizing the shared last-level cache. |
2013 |
| 25 |
Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth. |
2014 |
| 24 |
Effective Optimistic-Checker Tandem Core Design through Architectural Pruning. |
2007 |
| 24 |
AVF Stressmark: Towards an Automated Methodology for Bounding the Worst-Case Vulnerability to Soft Errors. |
2010 |
| 24 |
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors. |
2010 |
| 24 |
FeatherWeight: low-cost optical arbitration with QoS support. |
2011 |
| 24 |
Enabling datacenter servers to scale out economically and sustainably. |
2013 |
| 24 |
uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults. |
2013 |
| 24 |
TLC: a tag-less cache for reducing dynamic first level cache energy. |
2013 |
| 23 |
ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment. |
2010 |
| 22 |
Adaptive and Speculative Slack Simulations of CMPs on CMPs. |
2010 |
| 22 |
Hardware Support for Relaxed Concurrency Control in Transactional Memory. |
2010 |
| 22 |
Idempotent processor architecture. |
2011 |
| 22 |
Identifying and predicting timing-critical instructions to boost timing speculation. |
2011 |
| 21 |
Balancing Resource Utilization to Mitigate Power Density in Processor Pipelines. |
2005 |
| 21 |
A microarchitecture-based framework for pre- and post-silicon power delivery analysis. |
2009 |
| 21 |
Addressing End-to-End Memory Access Latency in NoC-Based Multicores. |
2012 |
| 20 |
A Floorplan-Aware Dynamic Inductive Noise Controller for Reliable Processor Design. |
2006 |
| 20 |
Global Multi-Threaded Instruction Scheduling. |
2007 |
| 20 |
Erasing Core Boundaries for Robust and Configurable Performance. |
2010 |
| 20 |
RDIP: return-address-stack directed instruction prefetching. |
2013 |
| 20 |
Crank it up or dial it down: coordinated multiprocessor frequency and folding control. |
2013 |
| 20 |
Skewed Compressed Caches. |
2014 |
| 20 |
ThyNVM: enabling software-transparent crash consistency in persistent memory systems. |
2015 |
| 19 |
Doppelgänger: a cache for approximate computing. |
2015 |
| 19 |
Verification of chip multiprocessor memory systems using a relaxed scoreboard. |
2008 |
| 19 |
Implementing high availability memory with a duplication cache. |
2008 |
| 19 |
Evaluating the effects of cache redundancy on profit. |
2008 |
| 19 |
Architectural Support for Fair Reader-Writer Locking. |
2010 |
| 19 |
The NoX router. |
2011 |
| 19 |
Vector Extensions for Decision Support DBMS Acceleration. |
2012 |
| 19 |
Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks. |
2012 |
| 19 |
Aegis: partitioning data block for efficient recovery of stuck-at-faults in phase change memory. |
2013 |
| 19 |
Use it or lose it: wear-out and lifetime in future chip multiprocessors. |
2013 |
| 19 |
Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults. |
2014 |
| 19 |
Enabling Realistic Fine-Grain Voltage Scaling with Reconfigurable Power Distribution Networks. |
2014 |
| 19 |
Futility Scaling: High-Associativity Cache Partitioning. |
2014 |
| 19 |
Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. |
2014 |
| 18 |
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering. |
2006 |
| 18 |
Virtually Pipelined Network Memory. |
2006 |
| 18 |
Strategies for mapping dataflow blocks to distributed hardware. |
2008 |
| 18 |
Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels. |
2010 |
| 18 |
Resilient microring resonator based photonic networks. |
2011 |
| 18 |
Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists. |
2014 |
| 18 |
Neural acceleration for GPU throughput processors. |
2015 |
| 18 |
Jump over ASLR: Attacking branch predictors to bypass ASLR. |
2016 |
| 17 |
Manager-client pairing: a framework for implementing coherence hierarchies. |
2011 |
| 16 |
Serialization-Aware Mini-Graphs: Performance with Fewer Resources. |
2006 |
| 16 |
Time Interpolation: So Many Metrics, So Few Registers. |
2007 |
| 16 |
Pipe Check: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models. |
2014 |
| 16 |
Harnessing Soft Computations for Low-Budget Fault Tolerance. |
2014 |
| 16 |
Large pages and lightweight memory management in virtualized environments: can you have it both ways? |
2015 |
| 15 |
Testudo: Heavyweight security analysis via statistical sampling. |
2008 |
| 15 |
SHARK: Architectural support for autonomic protection against stealth by rootkit exploits. |
2008 |
| 15 |
A systematic methodology to develop resilient cache coherence protocols. |
2011 |
| 15 |
A data layout optimization framework for NUCA-based multicores. |
2011 |
| 15 |
Inferred Models for Dynamic and Sparse Hardware-Software Spaces. |
2012 |
| 15 |
Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability. |
2012 |
| 15 |
BuMP: Bulk Memory Access Prediction and Streaming. |
2014 |
| 14 |
CCICheck: usingµhb graphs to verify the coherence-consistency interface. |
2015 |
| 14 |
Reducing Instruction Fetch Cost by Packing Instructions into RegisterWindows. |
2005 |
| 14 |
Optimal versus Heuristic Global Code Scheduling. |
2007 |
| 14 |
InstantCheck: Checking the Determinism of Parallel Programs Using On-the-Fly Incremental Hashing. |
2010 |
| 14 |
Virtual Snooping: Filtering Snoops in Virtualized Multi-cores. |
2010 |
| 14 |
SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads. |
2012 |
| 14 |
SHIFT: shared history instruction fetch for lean-core server processors. |
2013 |
| 14 |
Voltage Noise in Multi-Core Processors: Empirical Characterization and Optimization Opportunities. |
2014 |
| 14 |
PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research. |
2014 |
| 13 |
Using a configurable processor generator for computer architecture prototyping. |
2009 |
| 13 |
POWER7 multi-core processor design. |
2009 |
| 13 |
Energy efficient GPU transactional memory via space-time optimizations. |
2013 |
| 13 |
Imbalanced cache partitioning for balanced data-parallel programs. |
2013 |
| 13 |
Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures. |
2014 |
| 13 |
Multi-GPU System Design with Memory Networks. |
2014 |
| 13 |
Arbitrary Modulus Indexing. |
2014 |
| 13 |
Efficient persist barriers for multicores. |
2015 |
| 12 |
A register-file approach for row buffer caches in die-stacked DRAMs. |
2011 |
| 12 |
NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free? |
2014 |
| 12 |
Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures. |
2014 |
| 12 |
Architectural Specialization for Inter-Iteration Loop Dependence Patterns. |
2014 |
| 12 |
Free launch: optimizing GPU dynamic kernel launches through thread reuse. |
2015 |
| 12 |
Enabling interposer-based disintegration of multi-core processors. |
2015 |
| 11 |
Using Branch Correlation to Identify Infeasible Paths for Anomaly Detection. |
2006 |
| 11 |
Memory Protection through Dynamic Access Control. |
2006 |
| 11 |
Complementing user-level coarse-grain parallelism with implicit speculative parallelism. |
2011 |
| 11 |
The Performance Vulnerability of Architectural and Non-architectural Arrays to Permanent Faults. |
2012 |
| 11 |
DESC: energy-efficient data exchange using synchronized counters. |
2013 |
| 11 |
Efficient multiprogramming for multicores with SCAF. |
2013 |
| 11 |
B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors. |
2014 |
| 11 |
Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance. |
2015 |
| 11 |
Efficiently prefetching complex address patterns. |
2015 |
| 10 |
The Future Evolution of High-Performance Microprocessors. |
2005 |
| 10 |
Efficient Use of Invisible Registers in Thumb Code. |
2005 |
| 10 |
Tree register allocation. |
2009 |
| 10 |
MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP. |
2013 |
| 10 |
Micro-Sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems. |
2014 |
| 10 |
Avoiding information leakage in the memory controller with fixed service policies. |
2015 |
| 10 |
A scalable architecture for ordered parallelism. |
2015 |
| 10 |
A cloud-scale acceleration architecture. |
2016 |
| 9 |
A distributed processor state management architecture for large-window processors. |
2008 |
| 9 |
ATDetector: improving the accuracy of a commercial data race detector by identifying address transfer. |
2011 |
| 9 |
Predicting Coherence Communication by Tracking Synchronization Points at Run Time. |
2012 |
| 9 |
Efficient management of last-level caches in graphics processors for 3D scene rendering workloads. |
2013 |
| 9 |
Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency. |
2013 |
| 9 |
Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration. |
2014 |
| 9 |
RpStacks: Fast and Accurate Processor Design Space Exploration Using Representative Stall-Event Stacks. |
2014 |
| 9 |
Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems. |
2015 |
| 9 |
Fast support for unstructured data processing: the unified automata processor. |
2015 |
| 9 |
Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. |
2015 |
| 8 |
Wavelength stealing: an opportunistic approach to channel sharing in multi-chip photonic interconnects. |
2013 |
| 8 |
Dodec: Random-Link, Low-Radix On-Chip Networks. |
2014 |
| 8 |
Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors. |
2014 |
| 8 |
GPU register file virtualization. |
2015 |
| 8 |
Neuromorphic accelerators: a comparison between neuroscience and machine-learning approaches. |
2015 |
| 8 |
Coherence domain restriction on large scale systems. |
2015 |
| 8 |
Efficient GPU synchronization without scopes: saying no to complex consistency models. |
2015 |
| 8 |
Rubik: fast analytical power management for latency-critical systems. |
2015 |
| 8 |
Delegated persist ordering. |
2016 |
| 7 |
Control-Flow Decoupling. |
2012 |
| 7 |
A Front-End Execution Architecture for High Energy Efficiency. |
2014 |
| 7 |
Short-Circuiting Memory Traffic in Handheld Platforms. |
2014 |
| 7 |
Execution Drafting: Energy Efficiency through Computation Deduplication. |
2014 |
| 7 |
Improving DRAM latency with dynamic asymmetric subarray. |
2015 |
| 7 |
The inner most loop iteration counter: a new dimension in branch history. |
2015 |
| 7 |
TimeTrader: exploiting latency tail to save datacenter energy for online search. |
2015 |
| 7 |
Fork path: improving efficiency of ORAM by removing redundant memory accesses. |
2015 |
| 7 |
IMP: indirect memory prefetcher. |
2015 |
| 7 |
Stripes: Bit-serial deep neural network computing. |
2016 |
| 6 |
Incremental Commit Groups for Non-Atomic Trace Processing. |
2005 |
| 6 |
Architecture-aware automatic computation offload for native applications. |
2015 |
| 6 |
Border control: sandboxing accelerators. |
2015 |
| 6 |
Microarchitectural implications of event-driven server-side web applications. |
2015 |
| 6 |
Efficient warp execution in presence of divergence with collaborative context collection. |
2015 |
| 6 |
Characterizing, modeling, and improving the QoE of mobile devices with low battery level. |
2015 |
| 5 |
Data-Dependency Graph Transformations for Superblock Scheduling. |
2006 |
| 5 |
TransCom: transforming stream communication for load balance and efficiency in networks-on-chip. |
2011 |
| 5 |
Kernel Partitioning of Streaming Applications: A Statistical Approach to an NP-complete Problem. |
2012 |
| 5 |
Compiler Support for Optimizing Memory Bank-Level Parallelism. |
2014 |
| 5 |
Wormhole: Wisely Predicting Multidimensional Branches. |
2014 |
| 5 |
Loop-Aware Memory Prefetching Using Code Block Working Sets. |
2014 |
| 5 |
The CRISP performance model for dynamic voltage and frequency scaling in a GPGPU. |
2015 |
| 5 |
An integrated concurrency and core-ISA architectural envelope definition, and test oracle, for IBM POWER multiprocessors. |
2015 |
| 5 |
Prediction-guided performance-energy trade-off for interactive applications. |
2015 |
| 5 |
Continuous runahead: Transparent hardware acceleration for memory intensive workloads. |
2016 |
| 5 |
Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency. |
2016 |
| 4 |
Why design must change: rethinking digital design. |
2009 |
| 4 |
GPUMech: GPU Performance Modeling Technique Based on Interval Analysis. |
2014 |
| 4 |
Safe limits on voltage reduction efficiency in GPUs: a direct measurement approach. |
2015 |
| 4 |
A fast and accurate analytical technique to compute the AVF of sequential bits in a processor. |
2015 |
| 4 |
Efficiently enforcing strong memory ordering in GPUs. |
2015 |
| 4 |
Authenticache: harnessing cache ECC for system authentication. |
2015 |
| 4 |
Execution time prediction for energy-efficient hardware accelerators. |
2015 |
| 4 |
Low-cost soft error resilience with unified data verification and fine-grained recovery for acoustic sensor based detection. |
2016 |
| 4 |
Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. |
2016 |
| 4 |
Co-designing accelerators and SoC interfaces using gem5-Aladdin. |
2016 |
| 4 |
Improving bank-level parallelism for irregular applications. |
2016 |
| 4 |
KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. |
2016 |
| 3 |
SMARQ: Software-Managed Alias Register Queue for Dynamic Optimizations. |
2012 |
| 3 |
Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection. |
2013 |
| 3 |
DeSC: decoupled supply-compute communication management for heterogeneous architectures. |
2015 |
| 3 |
HyComp: a hybrid cache compression method for selection of data-type-specific compression methods. |
2015 |
| 3 |
Locking down insecure indirection with hardware-based control-data isolation. |
2015 |
| 3 |
Modeling the implications of DRAM failures and protection techniques on datacenter TCO. |
2015 |
| 3 |
More is less: improving the energy efficiency of data movement via opportunistic use of sparse codes. |
2015 |
| 3 |
Fused-layer CNN accelerators. |
2016 |
| 3 |
Towards efficient server architecture for virtualized network function deployment: Implications and implementations. |
2016 |
| 3 |
Racer: TSO consistency via race detection. |
2016 |
| 2 |
Architectures and algorithms for millisecond-scale molecular dynamics simulations of proteins. |
2008 |
| 2 |
CRAM: coded registers for amplified multiporting. |
2011 |
| 2 |
Allocating rotating registers by scheduling. |
2013 |
| 2 |
Implicit-storing and redundant-encoding-of-attribute information in error-correction-codes. |
2013 |
| 2 |
Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations. |
2014 |
| 2 |
Continuous, Low Overhead, Run-Time Validation of Program Executions. |
2014 |
| 2 |
Bias-Free Branch Predictor. |
2014 |
| 2 |
Bungee jumps: accelerating indirect branches through HW/SW co-design. |
2015 |
| 2 |
Adaptive guardband scheduling to improve system-level efficiency of the POWER7+. |
2015 |
| 2 |
MORC: a manycore-oriented compressed cache. |
2015 |
| 2 |
CLEAN-ECC: high reliability ECC for adaptive granularity memory system. |
2015 |
| 2 |
DynaMOS: dynamic schedule migration for heterogeneous cores. |
2015 |
| 2 |
Self-contained, accurate precomputation prefetching. |
2015 |
| 2 |
Confluence: unified instruction supply for scale-out servers. |
2015 |
| 2 |
Filtered runahead execution with a runahead buffer. |
2015 |
| 2 |
SABRes: Atomic object reads for in-memory rack-scale computing. |
2016 |
| 2 |
Cambricon-X: An accelerator for sparse neural networks. |
2016 |
| 2 |
Efficient kernel synthesis for performance portable programming. |
2016 |
| 2 |
Chainsaw: Von-neumann accelerators to leverage fused instruction chains. |
2016 |
| 2 |
Bridging the I/O performance gap for big data workloads: A new NVDIMM-based approach. |
2016 |
| 2 |
Spectral profiling: Observer-effect-free profiling by monitoring EM emanations. |
2016 |
| 2 |
From high-level deep neural models to FPGAs. |
2016 |
| 1 |
Microarchitecture in the system-level integration era. |
2008 |
| 1 |
BulkCommit: scalable and fast commit of atomic blocks in a lazy multiprocessor environment. |
2013 |
| 1 |
COMP: Compiler Optimizations for Manycore Processors. |
2014 |
| 1 |
SAWS: synchronization aware GPGPU warp scheduling for multiple independent warp schedulers. |
2015 |
| 1 |
vCache: architectural support for transparent and isolated virtual LLCs in virtualized environments. |
2015 |
| 1 |
WarpPool: sharing requests with inter-warp coalescing for throughput processors. |
2015 |
| 1 |
Enabling portable energy efficiency with memory accelerated library. |
2015 |
| 1 |
DCS: a fast and scalable device-centric server architecture. |
2015 |
| 1 |
Long term parking (LTP): criticality-aware resource allocation in OOO processors. |
2015 |
| 1 |
A unified memory network architecture for in-memory computing in commodity servers. |
2016 |
| 1 |
Path confidence based lookahead prefetching. |
2016 |
| 1 |
Ti-states: Processor power management in the temperature inversion region. |
2016 |
| 1 |
Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. |
2016 |
| 1 |
Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems. |
2016 |
| 1 |
An ultra low-power hardware accelerator for automatic speech recognition. |
2016 |
| 1 |
HARE: Hardware accelerator for regular expressions. |
2016 |
| 1 |
Evaluating programmable architectures for imaging and vision applications. |
2016 |
| 1 |
Lazy release consistency for GPUs. |
2016 |
| 1 |
Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting. |
2016 |
| 1 |
Quantifying and improving the efficiency of hardware-based mobile malware detectors. |
2016 |
| 1 |
vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. |
2016 |
| 1 |
Perceptron learning for reuse prediction. |
2016 |
| 1 |
NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints. |
2016 |
| 1 |
C3D: Mitigating the NUMA bottleneck via coherent DRAM caches. |
2016 |
| 1 |
OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures. |
2016 |
| 0 |
Message from the General Chairs. |
2005 |
| 0 |
Message from the Program Co-Chairs. |
2005 |
| 0 |
Control flow coalescing on a hybrid dataflow/von Neumann GPGPU. |
2015 |
| 0 |
Ultra-low power render-based collision detection for CPU/GPU systems. |
2015 |
| 0 |
Snatch: Opportunistically reassigning power allocation between processor and memory in 3D stacks. |
2016 |
| 0 |
pTask: A smart prefetching scheme for OS intensive applications. |
2016 |
| 0 |
MIMD synchronization on SIMT architectures. |
2016 |
| 0 |
Redefining QoS and customizing the power management policy to satisfy individual mobile users. |
2016 |
| 0 |
Contention-based congestion management in large-scale networks. |
2016 |
| 0 |
PoisonIvy: Safe speculation for secure memory. |
2016 |
| 0 |
The Bunker Cache for spatio-value approximation. |
2016 |
| 0 |
Register sharing for equality prediction. |
2016 |
| 0 |
CrystalBall: Statically analyzing runtime behavior via deep sequence learning. |
2016 |
| 0 |
ReplayConfusion: Detecting cache-based covert channel attacks using record and replay. |
2016 |
| 0 |
Dynamic error mitigation in NoCs using intelligent prediction techniques. |
2016 |
| 0 |
Zorua: A holistic approach to resource virtualization in GPUs. |
2016 |
| 0 |
A patch memory system for image processing and computer vision. |
2016 |
| 0 |
Improving energy efficiency of DRAM by exploiting half page row access. |
2016 |
| 0 |
Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. |
2016 |
| 0 |
The microarchitecture of a real-time robot motion planning accelerator. |
2016 |
| 0 |
CANDY: Enabling coherent DRAM caches for multi-node systems. |
2016 |
| 0 |
GRAPE: Minimizing energy for GPU applications with performance requirements. |
2016 |
| 0 |
Exploiting semantic commutativity in hardware speculation. |
2016 |
| 0 |
Dictionary sharing: An efficient cache compression scheme for compressed caches. |
2016 |
| 0 |
Data-centric execution of speculative parallel programs. |
2016 |
| 0 |
NeSC: Self-virtualizing nested storage controller. |
2016 |
| 0 |
Reducing data movement energy via online data clustering and encoding. |
2016 |
| 0 |
Keynotes: Internet of Things: History and hype, technology and policy. |
2016 |
| 0 |
Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation. |
2016 |