We show that the judicious usage of PEls can achieve significant speedup with minimal programming effort and no changes to the existing programming model.
This allows PIM operations to be interoperable with existing programming models, cache coherence protocols, and virtual memory mechanisms with no modification.
The discussion of SMS rationale was pretty thorough. It’s the key to make the mechanism much simpler than previous schedulers.
Bandwidth-limited and power-hungry memory channel.
However, it is still challenging to integrate the PIM architectures with existing systems in a seamless manner due to two common characteristics: unconventional programming models for in-memory computation units and lack of ability to utilize large on-chip caches.
Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models.
We focus on execution-driven, timing simulators, which are the preeminent tool for evaluation of future designs.
Since memory bandwidth is scarce in NUMA systems, prior work focuses on how to schedule threads across NUMA nodes to reduce bandwidth contention, similar to TOM for GPU-based asymmetric systems.
Asymmetric hierarchies provide ample opportunity to improve the performance and efficiency of memory-intensive applications.