| Cited by |
Paper title |
Year |
| 1235 |
Amdahl’s Law in the multicore era. |
2008 |
| 1022 |
Evaluating MapReduce for Multi-core and Multiprocessor Systems. |
2007 |
| 770 |
LogTM: log-based transactional memory. |
2006 |
| 617 |
System level analysis of fast, per-core DVFS using on-chip switching regulators. |
2008 |
| 616 |
Unbounded Transactional Memory. |
2005 |
| 589 |
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture. |
2005 |
| 499 |
Power Efficient Processor Architecture and The Cell Processor. |
2005 |
| 395 |
The Soft Error Problem: An Architectural Perspective. |
2005 |
| 386 |
Graphite: A distributed parallel simulator for multicores. |
2010 |
| 373 |
LogTM-SE: Decoupling Hardware Transactional Memory from Caches. |
2007 |
| 348 |
A novel architecture of the 3D stacked MRAM L2 cache for CMPs. |
2009 |
| 336 |
Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. |
2008 |
| 318 |
ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. |
2010 |
| 315 |
Regional congestion awareness for load balance in networks-on-chip. |
2008 |
| 266 |
Dynamic power-performance adaptation of parallel computation on chip multiprocessors. |
2006 |
| 262 |
Relaxing non-volatility for fast and energy-efficient STT-RAM caches. |
2011 |
| 243 |
Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. |
2007 |
| 231 |
CMP network-on-chip overlaid with multi-band RF-interconnect. |
2008 |
| 229 |
Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing. |
2010 |
| 229 |
A quantitative performance analysis model for GPU architectures. |
2011 |
| 213 |
An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. |
2010 |
| 211 |
Cluster-level feedback power control for performance optimization. |
2008 |
| 206 |
BigDataBench: A big data benchmark suite from internet services. |
2014 |
| 198 |
Chip Multithreading: Opportunities and Challenges. |
2005 |
| 191 |
High performance network virtualization with SR-IOV. |
2010 |
| 190 |
Concurrent Direct Network Access for Virtual Machine Monitors. |
2007 |
| 183 |
Performance, Energy, and Thermal Considerations for SMT and CMP Architectures. |
2005 |
| 183 |
Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. |
2009 |
| 177 |
SafeMem: Exploiting ECC-Memory for Detecting Memory Leaks and Memory Corruption During Production Runs. |
2005 |
| 177 |
Dynamically Specialized Datapaths for energy efficient computing. |
2011 |
| 176 |
BulletProof: a defect-tolerant CMP switch architecture. |
2006 |
| 170 |
Express Cube Topologies for on-Chip Interconnects. |
2009 |
| 169 |
CMP design space exploration subject to physical constraints. |
2006 |
| 166 |
Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing. |
2011 |
| 164 |
Construction and use of linear regression models for processor performance analysis. |
2006 |
| 159 |
Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors. |
2007 |
| 154 |
Thread block compaction for efficient SIMT control flow. |
2011 |
| 147 |
A Scalable, Non-blocking Approach to Transactional Memory. |
2007 |
| 144 |
Application-Level Correctness and its Impact on Fault Tolerance. |
2007 |
| 143 |
Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads. |
2006 |
| 143 |
FlexiShare: Channel sharing for an energy-efficient nanophotonic crossbar. |
2010 |
| 141 |
An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors. |
2007 |
| 140 |
I-CASH: Intelligently Coupled Array of SSD and HDD. |
2011 |
| 138 |
FlexiTaint: A programmable accelerator for dynamic taint propagation. |
2008 |
| 137 |
HARD: Hardware-Assisted Lockset-based Race Detection. |
2007 |
| 136 |
A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks. |
2005 |
| 136 |
FREE-p: Protecting non-volatile memory against both hard and soft errors. |
2011 |
| 132 |
A comprehensive approach to DRAM power management. |
2008 |
| 131 |
Operating system support for overlapping-ISA heterogeneous multi-core architectures. |
2010 |
| 131 |
CHIPPER: A low-complexity bufferless deflection router. |
2011 |
| 130 |
Technology comparison for large last-level caches (L3Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM. |
2013 |
| 127 |
Retention-aware placement in DRAM (RAPID): software methods for quasi-non-volatile DRAM. |
2006 |
| 125 |
The common case transactional behavior of multithreaded programs. |
2006 |
| 125 |
A Burst Scheduling Access Reordering Mechanism. |
2007 |
| 123 |
Adaptive Spill-Receive for robust high-performance caching in CMPs. |
2009 |
| 120 |
Application performance modeling in a virtualized environment. |
2010 |
| 119 |
Variation-aware dynamic voltage/frequency scaling. |
2009 |
| 117 |
Transition Phase Classification and Prediction. |
2005 |
| 117 |
Computational sprinting. |
2012 |
| 115 |
Scalable architectural support for trusted software. |
2010 |
| 113 |
Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications. |
2007 |
| 112 |
Phase characterization for power: evaluating control-flow-based and event-counter-based techniques. |
2006 |
| 112 |
A Hybrid solid-state storage architecture for the performance, energy consumption, and lifetime improvement. |
2010 |
| 111 |
Characterizing and Comparing Prevailing Simulation Techniques. |
2005 |
| 110 |
Cuckoo directory: A scalable directory for many-core systems. |
2011 |
| 110 |
Beyond block I/O: Rethinking traditional storage primitives. |
2011 |
| 109 |
Improving write operations in MLC phase change memory. |
2012 |
| 107 |
C-Oracle: Predictive thermal management for data centers. |
2008 |
| 107 |
Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures. |
2013 |
| 106 |
Elastic-buffer flow control for on-chip networks. |
2009 |
| 105 |
Perturbation-based Fault Screening. |
2007 |
| 104 |
Improving Multiple-CMP Systems Using Token Coherence. |
2005 |
| 103 |
Uncovering hidden loop level parallelism in sequential applications. |
2008 |
| 103 |
Designing a processor from the ground up to allow voltage/reliability tradeoffs. |
2010 |
| 103 |
Tiered-latency DRAM: A low latency and low cost DRAM architecture. |
2013 |
| 102 |
MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging. |
2007 |
| 102 |
Balancing DRAM locality and parallelism in shared memory CMP systems. |
2012 |
| 97 |
Illustrative Design Space Studies with Microarchitectural Regression Models. |
2007 |
| 97 |
Compute Caches. |
2017 |
| 95 |
Eliminating microarchitectural dependency from Architectural Vulnerability. |
2009 |
| 95 |
Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. |
2009 |
| 94 |
Prediction router: Yet another low latency on-chip router architecture. |
2009 |
| 92 |
The case for GPGPU spatial multitasking. |
2012 |
| 91 |
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors. |
2005 |
| 91 |
CORD: cost-effective (and nearly overhead-free) order-recording and data race detection. |
2006 |
| 91 |
Interval simulation: Raising the level of abstraction in architectural simulation. |
2010 |
| 90 |
Checkpointed Early Load Retirement. |
2005 |
| 90 |
CHOP: Adaptive filter-based DRAM caching for CMP server platforms. |
2010 |
| 89 |
PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches. |
2009 |
| 89 |
Accurate microarchitecture-level fault modeling for studying hardware faults. |
2009 |
| 88 |
Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. |
2009 |
| 88 |
SCD: A scalable coherence directory with flexible sharer set encoding. |
2012 |
| 88 |
TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. |
2012 |
| 86 |
DMA-aware memory energy management. |
2006 |
| 86 |
Exploiting parallelism and structure to accelerate the simulation of chip multi-processors. |
2006 |
| 86 |
ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers. |
2006 |
| 86 |
A low-radix and low-diameter 3D interconnection network design. |
2009 |
| 86 |
CAMP: A technique to estimate per-structure power at run-time using a few simple parameters. |
2009 |
| 84 |
A Unified Compressed Memory Hierarchy. |
2005 |
| 84 |
Performance and power optimization through data compression in Network-on-Chip architectures. |
2008 |
| 84 |
Blueshift: Designing processors for timing speculation from the ground up. |
2009 |
| 83 |
Towards scalable, energy-efficient, bus-based on-chip networks. |
2010 |
| 81 |
Trends in High-Performance Processors. |
2005 |
| 81 |
Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance. |
2010 |
| 80 |
MISE: Providing performance predictability and improving fairness in shared main memory systems. |
2013 |
| 80 |
Cache coherence for GPU architectures. |
2013 |
| 79 |
Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling. |
2007 |
| 79 |
CPU-assisted GPGPU on fused CPU-GPU architectures. |
2012 |
| 79 |
High-performance and energy-efficient mobile web browsing on big/little systems. |
2013 |
| 78 |
Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors. |
2005 |
| 77 |
Understanding the performance-temperature interactions in disk I/O of server workloads. |
2006 |
| 77 |
High performance file I/O for the Blue Gene/L supercomputer. |
2006 |
| 76 |
A first-order fine-grained multithreaded throughput model. |
2009 |
| 76 |
SolarCore: Solar energy driven multi-core architecture power management. |
2011 |
| 76 |
ESESC: A fast multicore simulator using Time-Based Sampling. |
2013 |
| 75 |
Distributing the Frontend for Temperature Reduction. |
2005 |
| 75 |
HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing. |
2011 |
| 74 |
DeCoR: A Delayed Commit and Rollback mechanism for handling inductive noise in processors. |
2008 |
| 73 |
Bridging the computation gap between programmable processors and hardwired accelerators. |
2009 |
| 73 |
Calvin: Deterministic or not? Free will to choose. |
2011 |
| 73 |
MRPB: Memory request prioritization for massively parallel processors. |
2014 |
| 72 |
Shared last-level TLBs for chip multiprocessors. |
2011 |
| 72 |
Runnemede: An architecture for Ubiquitous High-Performance Computing. |
2013 |
| 72 |
Accelerating write by exploiting PCM asymmetries. |
2013 |
| 68 |
CloudCache: Expanding and shrinking private caches. |
2011 |
| 68 |
Improving DRAM performance by parallelizing refreshes with accesses. |
2014 |
| 67 |
SENSS: Security Enhancement to Symmetric Shared Memory Multiprocessors. |
2005 |
| 66 |
Voltage emergency prediction: Using signatures to reduce operating margins. |
2009 |
| 65 |
A Memory-Level Parallelism Aware Fetch Policy for SMT Processors. |
2007 |
| 65 |
An OS-based alternative to full hardware coherence on tiled CMPs. |
2008 |
| 65 |
Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy. |
2009 |
| 65 |
In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects. |
2009 |
| 65 |
Warped register file: A power efficient register file for GPGPUs. |
2013 |
| 64 |
On the Limits of Leakage Power Reduction in Caches. |
2005 |
| 64 |
Interactions Between Compression and Prefetching in Chip Multiprocessors. |
2007 |
| 64 |
Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines. |
2007 |
| 64 |
Automated microprocessor stressmark generation. |
2008 |
| 63 |
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing. |
2007 |
| 63 |
Hardware-software integrated approaches to defend against software cache-based side channel attacks. |
2009 |
| 62 |
Addressing system-level trimming issues in on-chip nanophotonic networks. |
2011 |
| 62 |
A case for guarded power gating for multi-core processors. |
2011 |
| 61 |
Stretching the Limits of Clock-Gating Efficiency in Server-Class Processors. |
2005 |
| 61 |
A Small, Fast and Low-Power Register File by Bit-Partitioning. |
2005 |
| 61 |
Mercury: A fast and energy-efficient multi-level cell based Phase Change Memory system. |
2011 |
| 60 |
Practical and secure PCM systems by online detection of malicious write streams. |
2011 |
| 60 |
Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip. |
2012 |
| 59 |
Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications. |
2005 |
| 59 |
Worth their watts? - an empirical study of datacenter servers. |
2010 |
| 59 |
Reducing GPU offload latency via fine-grained CPU-GPU synchronization. |
2013 |
| 58 |
Efficient scrub mechanisms for error-prone emerging memories. |
2012 |
| 57 |
Modeling and Managing Thermal Profiles of Rack-mounted Servers with ThermoStat. |
2007 |
| 56 |
MRR: Enabling fully adaptive multicast routing for CMP interconnection networks. |
2009 |
| 56 |
Versatile prediction and fast estimation of Architectural Vulnerability Factor from processor performance metrics. |
2009 |
| 56 |
Programming the cloud. |
2011 |
| 56 |
Improving GPGPU resource utilization through alternative thread block scheduling. |
2014 |
| 55 |
Application-to-core mapping policies to reduce memory system interference in multi-core systems. |
2013 |
| 54 |
Enterprise IT Trends and Implications for Architecture Research. |
2005 |
| 54 |
Archipelago: A polymorphic cache design for enabling robust near-threshold operation. |
2011 |
| 54 |
Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips. |
2012 |
| 53 |
Microarchitectural Wire Management for Performance and Power in Partitioned Architectures. |
2005 |
| 53 |
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card. |
2005 |
| 53 |
Navigating heterogeneous processors with market mechanisms. |
2013 |
| 52 |
iCFP: Tolerating all-level cache misses in in-order processors. |
2009 |
| 52 |
Overcoming the challenges of crossbar resistive memory architectures. |
2015 |
| 51 |
Architecture support for guest-transparent VM protection from untrusted hypervisor and physical attacks. |
2013 |
| 50 |
A case for Refresh Pausing in DRAM memory systems. |
2013 |
| 50 |
Breaking the on-chip latency barrier using SMART. |
2013 |
| 50 |
Adaptive-latency DRAM: Optimizing DRAM timing for the common-case. |
2015 |
| 49 |
InfoShield: a security architecture for protecting information usage in memory. |
2006 |
| 49 |
Design and implementation of the blue gene/P snoop filter. |
2008 |
| 49 |
Design and implementation of software-managed caches for multicores with local memory. |
2009 |
| 49 |
Energy-efficient interconnect via Router Parking. |
2013 |
| 49 |
Power-performance co-optimization of throughput core architecture using resistive memory. |
2013 |
| 49 |
Architecture exploration for ambient energy harvesting nonvolatile processors. |
2015 |
| 48 |
NUcache: An efficient multicore cache organization based on Next-Use distance. |
2011 |
| 48 |
QuickIA: Exploring heterogeneous architectures on real prototypes. |
2012 |
| 48 |
SNNAP: Approximate computing on programmable SoCs via neural acceleration. |
2015 |
| 47 |
Scatter-Add in Data Parallel Architectures. |
2005 |
| 47 |
Thread-safe dynamic binary translation using transactional memory. |
2008 |
| 47 |
Dacota: Post-silicon validation of the memory subsystem in multi-core designs. |
2009 |
| 47 |
Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs. |
2012 |
| 47 |
Optimizing virtual machine scheduling in NUMA multicore systems. |
2013 |
| 47 |
i2WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations. |
2013 |
| 46 |
Adaptive placement and migration policy for an STT-RAM-based hybrid cache. |
2014 |
| 45 |
An approach for implementing efficient superscalar CISC processors. |
2006 |
| 45 |
A new server I/O architecture for high speed networks. |
2011 |
| 44 |
Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols. |
2011 |
| 44 |
Fast thread migration via cache working set prediction. |
2011 |
| 44 |
Dynamically heterogeneous cores through 3D resource pooling. |
2012 |
| 44 |
Enabling distributed generation powered sustainable high-performance data center. |
2013 |
| 44 |
A detailed GPU cache model based on reuse distance theory. |
2014 |
| 43 |
Heat Stroke: Power-Density-Based Denial of Service in SMT. |
2005 |
| 43 |
Coset coding to extend the lifetime of memory. |
2013 |
| 43 |
NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. |
2015 |
| 42 |
EXCES: External caching in energy saving storage systems. |
2008 |
| 42 |
Simple virtual channel allocation for high throughput and high frequency on-chip routers. |
2010 |
| 42 |
Design, integration and implementation of the DySER hardware accelerator into OpenSPARC. |
2012 |
| 41 |
Colorama: Architectural Support for Data-Centric Synchronization. |
2007 |
| 41 |
MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy. |
2011 |
| 41 |
Quasi-nonvolatile SSD: Trading flash memory nonvolatility to improve storage system performance for enterprise applications. |
2012 |
| 41 |
Staged Reads: Mitigating the impact of DRAM writes on DRAM reads. |
2012 |
| 41 |
EnergySmart: Toward energy-efficient manycores for Near-Threshold Computing. |
2013 |
| 41 |
DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture. |
2014 |
| 41 |
Suppressing the Oblivious RAM timing channel while making information leakage and program efficiency trade-offs. |
2014 |
| 41 |
Data retention in MLC NAND flash memory: Characterization, optimization, and recovery. |
2015 |
| 40 |
Accelerating and Adapting Precomputation Threads for Effcient Prefetching. |
2007 |
| 40 |
Improving cache performance using read-write partitioning. |
2014 |
| 40 |
MemZip: Exploring unconventional benefits from memory compression. |
2014 |
| 40 |
Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. |
2015 |
| 39 |
Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping. |
2007 |
| 39 |
ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture. |
2010 |
| 39 |
Dynamic parallelization of JavaScript applications using an ultra-lightweight speculation mechanism. |
2011 |
| 39 |
AgileRegulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture. |
2012 |
| 39 |
The dual-path execution model for efficient GPU control flow. |
2013 |
| 38 |
A Domain-Specific On-Chip Network Design for Large Scale Cache Systems. |
2007 |
| 38 |
Runtime validation of memory ordering using constraint graph checking. |
2008 |
| 37 |
Efficient complex operators for irregular codes. |
2011 |
| 37 |
System-level implications of disaggregated memory. |
2012 |
| 37 |
Timing channel protection for a shared memory controller. |
2014 |
| 37 |
Supporting x86-64 address translation for 100s of GPU lanes. |
2014 |
| 36 |
Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures. |
2007 |
| 36 |
An intelligent IT infrastructure for the future. |
2009 |
| 36 |
Optimizing Google’s warehouse scale computers: The NUMA experience. |
2013 |
| 36 |
Sonic Millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound. |
2013 |
| 35 |
A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures. |
2007 |
| 35 |
GPGPU performance and power estimation using machine learning. |
2015 |
| 34 |
Exploring the Design Space of Power-Aware Opto-Electronic Networked Systems. |
2005 |
| 34 |
Supporting highly-decoupled thread-level redundancy for parallel programs. |
2008 |
| 34 |
Reconciling specialization and flexibility through compound circuits. |
2009 |
| 34 |
Fast complete memory consistency verification. |
2009 |
| 34 |
Abstraction and microarchitecture scaling in early-stage power modeling. |
2011 |
| 34 |
Disintegrated control for energy-efficient and heterogeneous memory systems. |
2013 |
| 34 |
QuickRelease: A throughput-oriented approach to release consistency on GPUs. |
2014 |
| 34 |
Event-based scheduling for energy-efficient QoS (eQoS) in mobile Web applications. |
2015 |
| 33 |
A decoupled KILO-instruction processor. |
2006 |
| 33 |
Characterization of Direct Cache Access on multi-core systems and 10GbE. |
2009 |
| 33 |
Power-efficient computing for compute-intensive GPGPU applications. |
2013 |
| 33 |
Increasing TLB reach by exploiting clustering in page translations. |
2014 |
| 32 |
A bandwidth-aware memory-subsystem resource management using non-invasive resource profilers for large CMP systems. |
2010 |
| 32 |
Bloom Filter Guided Transaction Scheduling. |
2011 |
| 32 |
Refrint: Intelligent refresh to minimize power in on-chip multiprocessor cache hierarchies. |
2013 |
| 31 |
Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors. |
2006 |
| 31 |
Fundamental performance constraints in horizontal fusion of in-order cores. |
2008 |
| 30 |
UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all. |
2010 |
| 30 |
Hardware/software techniques for DRAM thermal management. |
2011 |
| 30 |
Achieving uniform performance and maximizing throughput in the presence of heterogeneity. |
2011 |
| 30 |
Efficient data streaming with on-chip accelerators: Opportunities and challenges. |
2011 |
| 30 |
Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting. |
2015 |
| 29 |
DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance. |
2010 |
| 29 |
SCRAP: Architecture for signature-based protection from Code Reuse Attacks. |
2013 |
| 29 |
Mosaic: Exploiting the spatial locality of process variation to reduce refresh energy in on-chip eDRAM modules. |
2014 |
| 29 |
Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning. |
2014 |
| 29 |
Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers. |
2014 |
| 29 |
Warp-level divergence in GPUs: Characterization, impact, and mitigation. |
2014 |
| 28 |
Multithreaded Value Prediction. |
2005 |
| 28 |
Increasing the cache efficiency by eliminating noise. |
2006 |
| 28 |
Practical off-chip meta-data for temporal memory streaming. |
2009 |
| 28 |
Explaining cache SER anomaly using DUE AVF measurement. |
2010 |
| 28 |
MORSE: Multi-objective reconfigurable self-optimizing memory scheduler. |
2012 |
| 28 |
Modeling performance variation due to cache sharing. |
2013 |
| 28 |
Scaling towards kilo-core processors with asymmetric high-radix topologies. |
2013 |
| 28 |
Dynamic management of TurboMode in modern multi-core chips. |
2014 |
| 28 |
TSO-CC: Consistency directed cache coherence for TSO. |
2014 |
| 27 |
Software Directed Issue Queue Power Reduction. |
2005 |
| 27 |
Efficient instruction schedulers for SMT processors. |
2006 |
| 27 |
Single-level integrity and confidentiality protection for distributed shared memory multiprocessors. |
2008 |
| 27 |
ACCESS: Smart scheduling for asymmetric cache CMPs. |
2011 |
| 27 |
Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors. |
2011 |
| 26 |
Low-Overhead Interactive Debugging via Dynamic Instrumentation with DISE. |
2005 |
| 26 |
Accurate Energy Dissipation and Thermal Modeling for Nanometer-Scale Buses. |
2005 |
| 26 |
High-throughput pairwise point interactions in Anton, a specialized machine for molecular dynamics simulation. |
2008 |
| 26 |
Power-Efficient DRAM Speculation. |
2008 |
| 26 |
Power shifting in Thrifty Interconnection Network. |
2011 |
| 26 |
JETC: Joint energy thermal and cooling management for memory and CPU subsystems in servers. |
2012 |
| 26 |
Statistical performance comparisons of computers. |
2012 |
| 26 |
Exploiting thermal energy storage to reduce data center capital and operating expenses. |
2014 |
| 26 |
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. |
2015 |
| 26 |
Exploiting compressed block size as an indicator of future reuse. |
2015 |
| 26 |
Coordinated static and dynamic cache bypassing for GPUs. |
2015 |
| 26 |
Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory. |
2015 |
| 25 |
Optical Interconnect Opportunities for Future Server Memory Systems. |
2007 |
| 25 |
Data-triggered threads: Eliminating redundant computation. |
2011 |
| 25 |
MP3: Minimizing performance penalty for power-gating of Clos network-on-chip. |
2014 |
| 25 |
Mascar: Speeding up GPU warps by reducing memory pitstops. |
2015 |
| 25 |
CATalyst: Defeating last-level cache side channel attacks in cloud computing. |
2016 |
| 24 |
Exploiting Postdominance for Speculative Parallelization. |
2007 |
| 24 |
Address-branch correlation: A novel locality for long-latency hard-to-predict branches. |
2008 |
| 24 |
BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution. |
2010 |
| 24 |
?-TM: Pessimistic invalidation for scalable lazy hardware transactional memory. |
2012 |
| 24 |
Layout-conscious random topologies for HPC off-chip interconnects. |
2013 |
| 24 |
ECM: Effective Capacity Maximizer for high-performance compressed caching. |
2013 |
| 24 |
NUAT: A non-uniform access time memory controller. |
2014 |
| 24 |
Quantifying sources of error in McPAT and potential impacts on architectural studies. |
2015 |
| 24 |
Power punch: Towards non-blocking power-gating of NoC routers. |
2015 |
| 23 |
Completely verifying memory consistency of test program executions. |
2006 |
| 23 |
Incorporating flexibility in Anton, a specialized machine for molecular dynamics simulation. |
2008 |
| 23 |
Architectural Contesting. |
2009 |
| 23 |
QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers. |
2014 |
| 23 |
ChargeCache: Reducing DRAM latency by exploiting row access locality. |
2016 |
| 22 |
Value Based BTB Indexing for indirect jump prediction. |
2010 |
| 22 |
Storage free confidence estimation for the TAGE branch predictor. |
2011 |
| 22 |
Power balanced pipelines. |
2012 |
| 22 |
Network congestion avoidance through Speculative Reservation. |
2012 |
| 22 |
Mobile CPU’s rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction. |
2016 |
| 21 |
Software-hardware cooperative memory disambiguation. |
2006 |
| 21 |
Improving Branch Prediction and Predicated Execution in Out-of-Order Processors. |
2007 |
| 21 |
Runahead Threads to improve SMT performance. |
2008 |
| 21 |
Decoupled dynamic cache segmentation. |
2012 |
| 21 |
Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers. |
2015 |
| 20 |
Store vectors for scalable memory dependence prediction and scheduling. |
2006 |
| 20 |
Roughness of microarchitectural design topologies and its implications for optimization. |
2008 |
| 20 |
Offline symbolic analysis to infer Total Store Order. |
2011 |
| 20 |
MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets. |
2012 |
| 20 |
Cost effective data center servers. |
2013 |
| 20 |
XChange: A market-based approach to scalable dynamic multi-resource allocation in multicore architectures. |
2015 |
| 19 |
Feedback mechanisms for improving probabilistic memory prefetching. |
2009 |
| 19 |
Soft error vulnerability aware process variation mitigation. |
2009 |
| 19 |
IADVS: On-demand performance for interactive applications. |
2010 |
| 19 |
HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. |
2011 |
| 19 |
Reducing the cost of persistence for nonvolatile heaps in end user devices. |
2014 |
| 19 |
Concurrent and consistent virtual machine introspection with hardware transactional memory. |
2014 |
| 19 |
CREAM: A Concurrent-Refresh-Aware DRAM Memory architecture. |
2014 |
| 19 |
Stash directory: A scalable directory for many-core coherence. |
2014 |
| 19 |
Priority-based cache allocation in throughput processors. |
2015 |
| 19 |
Prediction-based superpage-friendly TLB designs. |
2015 |
| 19 |
Unlocking bandwidth for GPUs in CC-NUMA systems. |
2015 |
| 19 |
Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM. |
2016 |
| 18 |
Prediction of CPU idle-busy activity pattern. |
2008 |
| 18 |
Checked Load: Architectural support for JavaScript type-checking on mobile processors. |
2011 |
| 18 |
WEST: Cloning data cache behavior using Stochastic Traces. |
2012 |
| 18 |
Supporting efficient collective communication in NoCs. |
2012 |
| 18 |
Pacman: Tolerating asymmetric data races with unintrusive hardware. |
2012 |
| 18 |
Improving multi-core performance using mixed-cell cache architecture. |
2013 |
| 18 |
Worm-Bubble Flow Control. |
2013 |
| 18 |
Sprinkler: Maximizing resource utilization in many-chip solid state disks. |
2014 |
| 18 |
PVCoherence: Designing flat coherence protocols for scalable verification. |
2014 |
| 18 |
Supporting superpages in non-contiguous physical memory. |
2015 |
| 18 |
BRAINIAC: Bringing reliable accuracy into neurally-implemented approximate computing. |
2015 |
| 17 |
Probabilistic counter updates for predictor hysteresis and stratification. |
2006 |
| 17 |
LiteTM: Reducing transactional state overhead. |
2010 |
| 17 |
Locality-aware data replication in the Last-Level Cache. |
2014 |
| 17 |
Spare register aware prefetching for graph algorithms on GPUs. |
2014 |
| 17 |
Implications of high energy proportional servers on cluster-wide energy proportionality. |
2014 |
| 17 |
Practical data value speculation for future high-end processors. |
2014 |
| 17 |
Talus: A simple way to remove cliffs in cache performance. |
2015 |
| 17 |
Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies. |
2015 |
| 16 |
Criticality-based optimizations for efficient load processing. |
2009 |
| 16 |
SIF: Overcoming the limitations of SIMD devices via implicit permutation. |
2010 |
| 16 |
StimulusCache: Boosting performance of chip multiprocessors with excess cache. |
2010 |
| 16 |
Delay-Hiding energy management mechanisms for DRAM. |
2010 |
| 16 |
Network within a network approach to create a scalable high-radix router microarchitecture. |
2012 |
| 16 |
Tag tables. |
2015 |
| 15 |
Parabix: Boosting the efficiency of text processing on commodity processors. |
2012 |
| 15 |
Cache restoration for highly partitioned virtualized systems. |
2012 |
| 15 |
Exploring high-performance and energy proportional interface for phase change memory systems. |
2013 |
| 15 |
Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks. |
2014 |
| 15 |
A scalable multi-path microarchitecture for efficient GPU control flow. |
2014 |
| 15 |
CAFO: Cost aware flip optimization for asymmetric memories. |
2015 |
| 15 |
Malware-aware processors: A framework for efficient online malware detection. |
2015 |
| 14 |
Implications of Device Timing Variability on Full Chip Timing. |
2007 |
| 14 |
PEEP: Exploiting predictability of memory dependences in SMT processors. |
2008 |
| 14 |
Adaptive Reliability Chipkill Correct (ARCC). |
2013 |
| 14 |
Precision-aware soft error protection for GPUs. |
2014 |
| 14 |
Revolver: Processor architecture for power efficient loop execution. |
2014 |
| 14 |
Understanding contention-based channels and using them for defense. |
2015 |
| 13 |
Exascale computing: The challenges and opportunities in the next decade. |
2010 |
| 13 |
Accelerating business analytics applications. |
2012 |
| 13 |
Undersubscribed threading on clustered cache architectures. |
2014 |
| 13 |
Domain knowledge based energy management in handhelds. |
2015 |
| 13 |
Paying to save: Reducing cost of colocation data center via rewards. |
2015 |
| 13 |
Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning. |
2016 |
| 12 |
Chip-multiprocessing and beyond. |
2006 |
| 12 |
PaCo: Probability-based path confidence prediction. |
2008 |
| 12 |
Adaptive Set-Granular Cooperative Caching. |
2012 |
| 12 |
TS-Router: On maximizing the Quality-of-Allocation in the On-Chip Network. |
2013 |
| 12 |
Dynamically detecting and tolerating IF-Condition Data Races. |
2014 |
| 12 |
DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead. |
2014 |
| 12 |
Up by their bootstraps: Online learning in Artificial Neural Networks for CMP uncore power management. |
2014 |
| 12 |
Scaling distributed cache hierarchies through computation and data co-scheduling. |
2015 |
| 11 |
Tapping ZettaRAMTMfor Low-Power Memory Systems. |
2005 |
| 11 |
Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions. |
2005 |
| 11 |
Skinflint DRAM system: Minimizing DRAM chip writes for low power. |
2013 |
| 11 |
Macho: A failure model-oriented adaptive cache architecture to enable near-threshold voltage scaling. |
2013 |
| 11 |
Accordion: Toward soft Near-Threshold Voltage Computing. |
2014 |
| 11 |
3D stacking of high-performance processors. |
2014 |
| 11 |
Augmenting low-latency HPC network with free-space optical links. |
2015 |
| 11 |
TABLA: A unified template-based framework for accelerating statistical machine learning. |
2016 |
| 10 |
Serializing instructions in system-intensive workloads: Amdahl’s Law strikes again. |
2008 |
| 10 |
Speculative instruction validation for performance-reliability trade-off. |
2008 |
| 10 |
COMIC++: A software SVM system for heterogeneous multicore accelerator clusters. |
2010 |
| 10 |
BulkSMT: Designing SMT processors for atomic-block execution. |
2012 |
| 10 |
Illusionist: Transforming lightweight cores into aggressive cores on demand. |
2013 |
| 10 |
Store-Load-Branch (SLB) predictor: A compiler assisted branch prediction for data dependent branches. |
2013 |
| 10 |
STM: Cloning the spatial and temporal memory access behavior. |
2014 |
| 10 |
Strategies for anticipating risk in heterogeneous system design. |
2014 |
| 10 |
Overcoming far-end congestion in large-scale networks. |
2015 |
| 10 |
Revisiting virtual L1 caches: A practical design using dynamic synonym remapping. |
2016 |
| 10 |
Energy-efficient address translation. |
2016 |
| 9 |
Exploiting criticality to reduce bottlenecks in distributed uniprocessors. |
2011 |
| 9 |
Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model. |
2013 |
| 9 |
RECAP: A region-based cure for the common cold (cache). |
2013 |
| 9 |
SCOC: High-radix switches made of bufferless clos networks. |
2015 |
| 9 |
FTXen: Making hypervisor resilient to hardware faults on relaxed cores. |
2015 |
| 9 |
Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. |
2016 |
| 9 |
A performance analysis framework for optimizing OpenCL applications on FPGAs. |
2016 |
| 9 |
HRL: Efficient and flexible reconfigurable logic for near-data processing. |
2016 |
| 8 |
Performance-aware speculation control using wrong path usefulness prediction. |
2008 |
| 8 |
Handling branches in TLS systems with Multi-Path Execution. |
2010 |
| 8 |
Hardware/software-based diagnosis of load-store queues using expandable activity logs. |
2011 |
| 8 |
Bridging the semantic gap: Emulating biological neuronal behaviors with simple digital neurons. |
2013 |
| 8 |
A Non-Inclusive Memory Permissions architecture for protection against cross-layer attacks. |
2014 |
| 8 |
Reducing read latency of phase change memory via early read and Turbo Read. |
2015 |
| 8 |
Warped-preexecution: A GPU pre-execution approach for improving latency hiding. |
2016 |
| 8 |
A case for toggle-aware compression for GPU systems. |
2016 |
| 7 |
Architectural support for synchronization-free deterministic parallel programming. |
2012 |
| 7 |
A novel system architecture for web scale applications using lightweight CPUs and virtualized I/O. |
2013 |
| 7 |
A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments. |
2013 |
| 7 |
Two level bulk preload branch prediction. |
2013 |
| 7 |
High-speed formal verification of heterogeneous coherence hierarchies. |
2013 |
| 7 |
Understanding the impact of gate-level physical reliability effects on whole program execution. |
2014 |
| 7 |
Atomic SC for simple in-order processors. |
2014 |
| 7 |
Transportation-network-inspired network-on-chip. |
2014 |
| 7 |
FADE: A programmable filtering accelerator for instruction-grain monitoring. |
2014 |
| 7 |
Exploring architectural heterogeneity in intelligent vision systems. |
2015 |
| 7 |
GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures. |
2015 |
| 7 |
BeBoP: A cost effective predictor infrastructure for superscalar value prediction. |
2015 |
| 7 |
Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU Integrated systems. |
2015 |
| 7 |
Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family. |
2016 |
| 7 |
A large-scale study of soft-errors on GPUs in the field. |
2016 |
| 7 |
Atomic persistence for SCM with a non-intrusive backend controller. |
2016 |
| 6 |
High-Performance low-vcc in-order core. |
2010 |
| 6 |
Flexible register management using reference counting. |
2012 |
| 6 |
In-network traffic regulation for Transactional Memory. |
2013 |
| 6 |
iPatch: Intelligent fault patching to improve energy efficiency. |
2015 |
| 6 |
Flask coherence: A morphable hybrid coherence protocol to balance energy, performance and scalability. |
2015 |
| 6 |
Balancing reliability, cost, and performance tradeoffs with FreeFault. |
2015 |
| 6 |
Selective GPU caches to eliminate CPU-GPU HW cache coherence. |
2016 |
| 5 |
Speculative synchronization and thread management for fine granularity threads. |
2006 |
| 5 |
Fabric convergence implications on systems architecture. |
2008 |
| 5 |
HARE: Hardware assisted reverse execution. |
2010 |
| 5 |
DMA++: on the fly data realignment for on-chip memories. |
2010 |
| 5 |
Fg-STP: Fine-Grain Single Thread Partitioning on Multicores. |
2011 |
| 5 |
Architectural framework for supporting operating system survivability. |
2011 |
| 5 |
A group-commit mechanism for ROB-based processors implementing the X86 ISA. |
2013 |
| 5 |
Over-clocked SSD: Safely running beyond flash memory chip I/O clock specs. |
2014 |
| 5 |
CDTT: Compiler-generated data-triggered threads. |
2014 |
| 5 |
Scalably verifiable dynamic power management. |
2014 |
| 5 |
GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management. |
2014 |
| 5 |
High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches. |
2015 |
| 5 |
Increasing multicore system efficiency through intelligent bandwidth shifting. |
2015 |
| 5 |
“Understanding the virtualization “”Tax”” of scale-out pass-through GPUs in GaaS clouds: An empirical study. “ |
2015 |
| 5 |
CiDRA: A cache-inspired DRAM resilience architecture. |
2015 |
| 5 |
Scalable communication architecture for network-attached accelerators. |
2015 |
| 5 |
VSR sort: A novel vectorised sorting algorithm&architecture extensions for future microprocessors. |
2015 |
| 5 |
Efficient footprint caching for Tagless DRAM Caches. |
2016 |
| 5 |
A complete key recovery timing attack on a GPU. |
2016 |
| 5 |
McVerSi: A test generation framework for fast memory consistency verification in simulation. |
2016 |
| 5 |
Pushing the limits of accelerator efficiency while retaining programmability. |
2016 |
| 5 |
Lattice priority scheduling: Low-overhead timing-channel protection for a shared memory controller. |
2016 |
| 5 |
Restore truncation for performance improvement in future DRAM systems. |
2016 |
| 5 |
Modeling cache performance beyond LRU. |
2016 |
| 5 |
SLaC: Stage laser control for a flattened butterfly network. |
2016 |
| 4 |
Interconnect-Centric Computing. |
2007 |
| 4 |
Branch-mispredict level parallelism (BLP) for control independence. |
2008 |
| 4 |
LeadOut: Composing low-overhead frequency-enhancing techniques for single-thread performance in configurable multicores. |
2010 |
| 4 |
BulkCompactor: Optimized deterministic execution via Conflict-Aware commit of atomic blocks. |
2012 |
| 4 |
Architectural perspectives of future wireless base stations based on the IBM PowerEN™processor. |
2012 |
| 4 |
How to implement effective prediction and forwarding for fusable dynamic multicore architectures. |
2013 |
| 4 |
Correction prediction: Reducing error correction latency for on-chip memories. |
2015 |
| 4 |
CompEx: Compression-expansion coding for energy, latency, and lifetime improvements in MLC/TLC NVM. |
2016 |
| 4 |
ScalCore: Designing a core for voltage scalability. |
2016 |
| 4 |
Best-offset hardware prefetching. |
2016 |
| 4 |
Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines. |
2016 |
| 4 |
Towards high performance paged memory for GPUs. |
2016 |
| 4 |
SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies. |
2017 |
| 4 |
Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques. |
2017 |
| 3 |
Petascale Computing Research Challenges - A Manycore Perspective. |
2007 |
| 3 |
Lightweight predication support for out of order processors. |
2009 |
| 3 |
MOPED: Orchestrating interprocess message data on CMPs. |
2011 |
| 3 |
Safe and efficient supervised memory systems. |
2011 |
| 3 |
Improving smartphone user experience by balancing performance and energy with probabilistic QoS guarantee. |
2016 |
| 3 |
LASER: Light, Accurate Sharing dEtection and Repair. |
2016 |
| 3 |
A low power software-defined-radio baseband processor for the Internet of Things. |
2016 |
| 3 |
Parity Helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems. |
2016 |
| 3 |
Symbiotic job scheduling on the IBM POWER8. |
2016 |
| 3 |
MaPU: A novel mathematical computing architecture. |
2016 |
| 3 |
Transparent and Efficient CFI Enforcement with Intel Processor Trace. |
2017 |
| 2 |
Industrial Perspectives: Platform Design Challenges with Many cores. |
2006 |
| 2 |
Opportunities beyond single-core microprocessors. |
2009 |
| 2 |
Accelerating decoupled look-ahead via weak dependence removal: A metaheuristic approach. |
2014 |
| 2 |
Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis. |
2015 |
| 2 |
Alloy: Parallel-serial memory channel architecture for single-chip heterogeneous processor systems. |
2015 |
| 2 |
Approximating warps with intra-warp operand value similarity. |
2016 |
| 2 |
Software transparent dynamic binary translation for coarse-grain reconfigurable architectures. |
2016 |
| 2 |
Amdahl’s law for lifetime reliability scaling in heterogeneous multicore processors. |
2016 |
| 2 |
Cost effective physical register sharing. |
2016 |
| 2 |
A low-power hybrid reconfigurable architecture for resistive random-access memories. |
2016 |
| 2 |
LiveSim: Going live with microarchitecture simulation. |
2016 |
| 2 |
Core tunneling: Variation-aware voltage noise mitigation in GPUs. |
2016 |
| 2 |
Venice: Exploring server architectures for effective resource sharing. |
2016 |
| 2 |
PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. |
2017 |
| 1 |
Architecting for power management: The IBM POWER7TMapproach. |
2010 |
| 1 |
Hybrid latency tolerance for robust energy-efficiency on 1000-core data parallel processors. |
2013 |
| 1 |
Low-overhead and high coverage run-time race detection through selective meta-data management. |
2014 |
| 1 |
DVFS for NoCs in CMPs: A thread voting approach. |
2016 |
| 1 |
DUANG: Fast and lightweight page migration in asymmetric memory systems. |
2016 |
| 1 |
PleaseTM: Enabling transaction conflict management in requester-wins hardware transactional memory. |
2016 |
| 1 |
Minimal disturbance placement and promotion. |
2016 |
| 1 |
iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs. |
2016 |
| 1 |
Efficient synthetic traffic models for large, complex SoCs. |
2016 |
| 1 |
Efficient GPU hardware transactional memory through early conflict resolution. |
2016 |
| 1 |
The runahead network-on-chip. |
2016 |
| 1 |
RADAR: Runtime-assisted dead region management for last-level caches. |
2016 |
| 1 |
SizeCap: Efficiently handling power surges in fuel cell powered data centers. |
2016 |
| 1 |
A market approach for handling power emergencies in multi-tenant data center. |
2016 |
| 1 |
Cooper: Task Colocation with Cooperative Games. |
2017 |
| 1 |
Secure Dynamic Memory Scheduling Against Timing Channel Attacks. |
2017 |
| 1 |
Controlled Kernel Launch for Dynamic Parallelism in GPUs. |
2017 |
| 1 |
Exploring Hyperdimensional Associative Memory. |
2017 |
| 1 |
SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization. |
2017 |
| 1 |
ATOM: Atomic Durability in Non-volatile Memory through Hardware Logging. |
2017 |
| 1 |
MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories. |
2017 |
| 1 |
Needle: Leveraging Program Analysis to Analyze and Extract Accelerators from Whole Programs. |
2017 |
| 1 |
Dynamic GPGPU Power Management Using Adaptive Model Predictive Control. |
2017 |
| 1 |
SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support. |
2017 |
| 1 |
GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks. |
2017 |
| 1 |
Near-Ideal Networks-on-Chip for Servers. |
2017 |
| 0 |
The Future of Computer Architecture Research: An Industrial Perspective. |
2005 |
| 0 |
Industrial Perspectives: The Next Roadblocks in SOC Evolution: On-Chip Storage Capacity and Off-Chip Bandwidth. |
2006 |
| 0 |
Industrial Perspectives: System IO Network Evolution - Closing Requirement Gaps. |
2006 |
| 0 |
New architectures for a new biology. |
2006 |
| 0 |
Intel’s Tera-scale Computing Project: The first five years, the next five years. |
2008 |
| 0 |
Compilers and parallel computing systems. |
2008 |
| 0 |
Industrial perspectives panel. |
2009 |
| 0 |
Multi-core demands multi-interfaces. |
2009 |
| 0 |
Is hardware innovation over? |
2010 |
| 0 |
Extreme scale computing: Challenges and opportunities. |
2010 |
| 0 |
How’s the parallel computing revolution going? |
2011 |
| 0 |
Improving in-memory database index performance with Intel®Transactional Synchronization Extensions. |
2014 |
| 0 |
Run-time monitoring with adjustable overhead using dataflow-guided filtering. |
2015 |
| 0 |
Design and implementation of a mobile storage leveraging the DRAM interface. |
2016 |
| 0 |
SCsafe: Logging sequential consistency violations continuously and precisely. |
2016 |
| 0 |
PABST: Proportionally Allocated Bandwidth at the Source and Target. |
2017 |
| 0 |
Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources. |
2017 |
| 0 |
BRAVO: Balanced Reliability-Aware Voltage Optimization. |
2017 |
| 0 |
Hipster: Hybrid Task Manager for Latency-Critical Cloud Workloads. |
2017 |
| 0 |
Designing Low-Power, Low-Latency Networks-on-Chip by Optimally Combining Electrical and Optical Links. |
2017 |
| 0 |
Design and Analysis of an APU for Exascale Computing. |
2017 |
| 0 |
Boomerang: A Metadata-Free Architecture for Control Flow Delivery. |
2017 |
| 0 |
Partial Row Activation for Low-Power DRAM System. |
2017 |
| 0 |
High-Bandwidth Low-Latency Approximate Interconnection Networks. |
2017 |
| 0 |
Efficient Sequential Consistency in GPUs via Relativistic Cache Coherence. |
2017 |
| 0 |
Static Bubble: A Framework for Deadlock-Free Irregular On-chip Topologies. |
2017 |
| 0 |
Cooperative Path-ORAM for Effective Memory Bandwidth Sharing in Server Settings. |
2017 |
| 0 |
Camouflage: Memory Traffic Shaping to Mitigate Timing Attacks. |
2017 |
| 0 |
Cold Boot Attacks are Still Hot: Security Analysis of Memory Scramblers in Modern Processors. |
2017 |
| 0 |
Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor. |
2017 |
| 0 |
Architecting an Energy-Efficient DRAM System for GPUs. |
2017 |
| 0 |
Processing-in-Memory Enabled Graphics Processors for 3D Rendering. |
2017 |
| 0 |
Design and Evaluation of AWGR-Based Photonic NoC Architectures for 2.5D Integrated High Performance Computing Systems. |
2017 |
| 0 |
Defect Analysis and Cost-Effective Resilience Architecture for Future DRAM Devices. |
2017 |
| 0 |
Random Folded Clos Topologies for Datacenter Networks. |
2017 |
| 0 |
Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators. |
2017 |
| 0 |
Enabling Effective Module-Oblivious Power Gating for Embedded Processors. |
2017 |
| 0 |
Fast Decentralized Power Capping for Server Clusters. |
2017 |
| 0 |
Maximizing Cache Performance Under Uncertainty. |
2017 |
| 0 |
Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures. |
2017 |
| 0 |
Supporting Address Translation for Accelerator-Centric Architectures. |
2017 |
| 0 |
G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs. |
2017 |
| 0 |
NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture. |
2017 |
| 0 |
Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse Distance. |
2017 |
| 0 |
Application-Specific Performance-Aware Energy Optimization on Android Mobile Devices. |
2017 |
| 0 |
Pilot Register File: Energy Efficient Partitioned Register File for GPUs. |
2017 |
| 0 |
FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. |
2017 |
| 0 |
Reliability-Aware Scheduling on Heterogeneous Multicore Processors. |
2017 |
| 0 |
KAML: A Flexible, High-Performance Key-Value SSD. |
2017 |
| 0 |
A Split Cache Hierarchy for Enabling Data-Oriented Optimizations. |
2017 |
| 0 |
Understanding and Optimizing Power Consumption in Memory Networks. |
2017 |
| 0 |
SOUP-N-SALAD: Allocation-Oblivious Access Latency Reduction with Asymmetric DRAM Microarchitectures. |
2017 |
| 0 |
Tiny Directory: Efficient Shared Memory in Many-Core Systems with Ultra-Low-Overhead Coherence Tracking. |
2017 |