| 2026 | A Diagonal Block Memory-Aware Polynomial Preconditioner for Linear and Eigenvalue Solvers. Xiaojian Yang, Yuhui Ni, Fan Yuan, Shengguo Li, Dezun Dong, Chuanfu Xu, Haipeng Jia, Jie Liu |
| 2026 | A Distributed Matrix-Block-Vector Multiplication in Presence of System Performance Variability. Yuchen Ma, Bin Ren, Andreas Stathopoulos |
| 2026 | APERTURE: Algorithm-System Co-optimization for Temporal Graph Network Inference. Yiqing Wang, Hailong Yang, Enze Yu, Qingxiao Sun, Kejie Ma, Kaige Zhang, Chenhao Xie, Depei Qian |
| 2026 | ASM-SpMM: Unleashing the Potential of Arm SME for Sparse Matrix Multiplication Acceleration. Jiazhi Jiang, Xijia Yao, Jiayu Chen, Jinhui Wei, Dan Huang, Yutong Lu |
| 2026 | Accelerating Sparse Transformer Inference on GPU. Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun |
| 2026 | BEEMS: Boosting Machine Vision Efficiency via Computation Graph-Based Memory Smoothing. Hanjing Shen, Fangxin Liu, Jian Liu, Li Jiang, Haibing Guan |
| 2026 | Binary Compatible Critical Section Delegation. Junyao Zhang, Zhuo Wang, Zhe Zhou |
| 2026 | CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training. Yida Gu, Fakang Wang, Jianhao Fu, Zhenhang Sun, Qianyu Zhang, Hairui Zhao, Xingchen Liu, Yang Tian, Wenjing Huang, Zedong Liu, Yifan Chen, Jinwu Yang, Yueyuan Zhou, Qian Zhao, Haoxu Li, Tao Wang, Feng Yu, Zhan Wang, Guangming Tan, Dingwen Tao |
| 2026 | COCCL: A Collective Communication Library Supporting Easy Integration and Configuration of Customized Compression for Scalable LLM Training. Xingchen Liu, Haoran Kong, Hairui Zhao, Shengkai Lyu, Zheng Wei, Man Liu, Xingjian Tian, Liyang Zhao, Zhuohan Chen, Fakang Wang, Zizhong Chen, Zhan Wang, Guangming Tan, Dingwen Tao |
| 2026 | Cacheman: A Comprehensive Last-Level Cache Management System for Multi-tenant Clouds. Xiaokang Hu, Yuchao Cao, Naixuan Guan, Yifan Wu, Xishi Qiu, Shengdong Dai, Ben Luo, Sanchuan Cheng, Fudong Qiu, Yibin Shen, Jiesheng Wu |
| 2026 | Characterizing Matrix Multiplication Units across General Parallel Patterns in Scientific Computing. Yuechen Lu, Hongwei Zeng, Marc Casas, Weifeng Liu |
| 2026 | ChituDiffusion: A Data-Characteristic-Aware Serving System for Diffusion Models. Chengzhang Wu, Liyan Zheng, Haojie Wang, Kezhao Huang, Zixuan Ma, Dong Dong, Jidong Zhai |
| 2026 | Concurrent Balanced Augmented Trees. Evan Wrench, Ajay Singh, Younghun Roh, Panagiota Fatourou, Siddhartha Jayanti, Eric Ruppert, Yuanhao Wei |
| 2026 | DTMiner: A Data-Centric System for Efficient Temporal Motif Mining. Yinbo Hou, Hao Qi, Ligang He, Jin Zhao, Yu Zhang, Hui Yu, Longlong Lin, Lin Gu, Wenbin Jiang, Xiaofei Liao, Hai Jin |
| 2026 | DiggerBees: Depth First Search Leveraging Hierarchical Block-Level Stealing on GPUs. Yuyao Niu, Yuechen Lu, Weifeng Liu, Marc Casas |
| 2026 | Dynamic Detection of Inefficient Data Mapping Patterns in Heterogeneous OpenMP Applications. Luke Marzen, Junhyung Shim, Ali Jannesari |
| 2026 | ElasGNN: An Elastic Training Framework for Distributed GNN Training. Siqi Wang, Hailong Yang, Pengbo Wang, Hongliang Cao, Yufan Xu, Xuezhu Wang, Zhongzhi Luan, Yi Liu, Depei Qian |
| 2026 | Elastor: Elastic and Efficient Model Partitioning and Checkpointing for Fault-Tolerant Distributed Training. Xuanyu Wang, Fangcheng Fu, Haoyang Li, Hao Ge, Sheng Lin, Jiawen Niu, Bin Cui |
| 2026 | Exploiting Efficient Mapping and Pipelined Execution for Accelerating SpMV on Tensor Cores. Kaige Zhang, Hailong Yang, Xin You, Tianyu Feng, Yufan Xu, Zhongzhi Luan, Yi Liu, Depei Qian |
| 2026 | Faster and Cheaper: Pushing the Sequence Alignment Throughput with Commercial CPUs. Zhonghai Zhang, Yewen Li, Ke Meng, Chunming Zhang, Guangming Tan |
| 2026 | Fixing Non-blocking Data Structures for Better Compatibility with Memory Reclamation Schemes. Md Amit Hasan Arovi, Ruslan Nikolaev |
| 2026 | FlashAttention-T: Towards Fully Tensorized Attention by Exploiting Tensor-Vector Parallelism. Jianxing Xu, Yuanbo Wen, Jun Bi, Ruibai Xu, Guanglin Xu, Rui Zhang, Wei Li, Ling Li, Tianshi Chen, Qi Guo, Yunji Chen |
| 2026 | Hapax Locks: Scalable Value-Based Mutual Exclusion. Dave Dice, Alex Kogan |
| 2026 | HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism. Geng Zhang, Shenggan Cheng, Xuanlei Zhao, Ziming Liu, Yang You |
| 2026 | HierCut: Enabling 16-bit Format Mixed Precision for Molecular Dynamics through Hierarchical Cutoff. Zeyu Song, Lin Gan, Xiaohui Duan, Zhengrui Li, Jiayu Fu, Yinuo Wang, Guangzhao Li, Guangwen Yang |
| 2026 | High-Throughput Non-uniformly Quantized 3-bit LLM Inference. Yuang Chen, Wenqi Zeng, Jeffrey Xu Yu |
| 2026 | JanusQuant: Accurate and Efficient 2-bit KV Cache Quantization for Long-Context Inference. Chengyu Sun, Yaqi Xia, Hulin Wang, Donglin Yang, Xiaobo Zhou, Dazhao Cheng |
| 2026 | Laser: Unlocking Layer-Level Scheduling for Efficient Multi-SLO LLM Serving. Jianxiong Liao, Quanxing Dong, Yunkai Liang, Zhi Zhou, Xu Chen |
| 2026 | MetaAttention: A Unified and Performant Attention Framework across Hardware Backends. Feiyang Chen, Yu Cheng, Lei Wang, Yuqing Xia, Ziming Miao, Lingxiao Ma, Fan Yang, Jilong Xue, Zhi Yang, Mao Yang, Xingda Wei, Haibo Chen |
| 2026 | MixFusion: A Patch-Level Parallel Serving System for Mixed-Resolution Diffusion Models. Desen Sun, Zepeng Zhao, Yuke Wang |
| 2026 | Multiverse: Transactional Memory with Dynamic Multiversioning. Gaetano Coccimiglio, Trevor Brown, Srivatsan Ravi |
| 2026 | PANA: A Fine-Grained Runtime-Adaptive Load Balancing for Parallel SpMV on Multicore CPUs. Haodong Bian, Youhui Zhang, Xiang Fei, Jianqiang Huang, Xiaoying Wang |
| 2026 | PIM-zd-tree: A Fast Space-Partitioning Index Leveraging Processing-in-Memory. Yiwei Zhao, Hongbo Kang, Ziyang Men, Yan Gu, Guy E. Blelloch, Laxman Dhulipala, Charles McGuffey, Phillip B. Gibbons |
| 2026 | PRISM: An Efficient GPU-Based Lossy Compression Framework for Progressive Data Retrieval with Multi-Level Interpolation. Bing Lu, Zedong Liu, Hairui Zhao, Dejun Luo, Wenjing Huang, Yida Gu, Jinyang Liu, Guangming Tan, Dingwen Tao |
| 2026 | ParDiff: Efficiently Parallelizing Reverse-Mode Automatic Differentiation with Direct Indexing. Shuhong Huang, Shizhi Tang, Yuan Wen, Huanqi Cao, Ruibai Tang, Yidong Chen, Jiping Yu, Yang Li, Chao Jiang, Limin Xiao, Jidong Zhai |
| 2026 | Parallel Dynamic Spatial Indexes. Ziyang Men, Bo Huang, Yan Gu, Yihan Sun |
| 2026 | Pipelonk: Accelerating End-to-End Zero-Knowledge Proof Generation on GPUs for PLONK-Based Protocols. Zhiyuan Zhang, Yanxin Cai, Wenhao Yin, Xueyu Wu, Yi Wang, Lei Ju, Zhuoran Ji |
| 2026 | Proceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2026, Sydney, NSW, Australia, 31 January 2026 - 4 February 2026 Tony Hosking, Madan Musuvathi, Kenjiro Taura |
| 2026 | ROME: Maximizing GPU Efficiency for All-Pairs Shortest Path via Taming Fine-Grained Irregularities. Weile Luo, Yuhan Chen, Xiangrui Yu, Qiang Wang, Ruibo Fan, Hongyuan Liu, Xiaowen Chu |
| 2026 | Rethinking Thread Scheduling under Oversubscription: A User-Space Framework for Coordinating Multi-runtime and Multi-process Workloads. Aleix Roca, Vicenç Beltran |
| 2026 | RoMeo: Mitigating Dual-dimensional Outliers with Rotated Mixed Precision Quantization. Qihao Zhang, Mingliang Tang, Mingshu Zhai, Kinman Lei, Jidong Zhai |
| 2026 | Root-Down Exposure for Maximal Clique Enumeration on GPUs. Zhe Pan, Peng Qu, Youhui Zhang |
| 2026 | SPIDER: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swapping. Qiqi Gu, Chenpeng Wu, Heng Shi, Jianguo Yao |
| 2026 | Scaling GPU-to-CPU Migration for Efficient Distributed Execution on CPU Clusters. Ruobing Han, Hyesoon Kim |
| 2026 | Sharded Elimination and Combining for Highly-Efficient Concurrent Stacks. Ajay Singh, Nikos Metaxakis, Panagiota Fatourou |
| 2026 | TAC: Cache-Based System for Accelerating Billion-Scale GNN Training on Multi-GPU Platform. Zhiqiang Liang, Hongyu Gao, Jue Wang, Fang Liu, Xingguo Shi, Junyu Gu, Peng Di, Sian Li, Lei Tang, Chunbao Zhou, Lian Zhao, Yangang Wang, Xuebin Chi |
| 2026 | Towards Singular Value Decomposition for Rank-Deficient Matrices: An Efficient and Accurate Algorithm on GPU Architectures. Lu Shi, Weiwei Xu, Shaoshuai Zhang |
| 2026 | Trojan Horse: Aggregate-and-Batch for Scaling Up Sparse Direct Solvers on GPU Clusters. Yida Li, Siwei Zhang, Yiduo Niu, Yang Du, Qingxiao Sun, Zhou Jin, Weifeng Liu |
| 2026 | UFO Trees: Practical and Provably-Efficient Parallel Batch-Dynamic Trees. Quinten De Man, Atharva Sharma, Kishen N. Gowda, Laxman Dhulipala |
| 2026 | VDHA: Vector-Driven Hash Aggregation for Sparse Matrix-Sparse Vector Multiplication on GPUs. Yuchen Li, Zhe Pan, Peng Qu, Youhui Zhang |
| 2026 | Waste-Efficient Work Stealing. Kyle Singer, Kunal Agrawal, Tao B. Schardl |
| 2026 | zBuffer: Zero-Copy and Metadata-Free Serialization for Fast RPC with Scatter-Gather Reflection. Xiangyu Liu, Huiba Li, Shun Gai, Youmin Chen, Yiming Zhang |