1. High Performance Deep Learning System on GPU cluster

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management
Jiarui Fang, Zilin Zhu, Shenggui Li, Hui Su, Yang Yu, Jie Zhou, Yang You IEEE Transactions on Parallel and Distributed Systems [PDF] [Software] [PDF]

TurboTransformers: An Efficient GPU Serving System For Transformer Models
Jiarui Fang, Yang Yu, Chengduo Zhao, Jie Zhou, Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel (PPoPP 2021), Virtual Event, Republic of Korea [PDF] [Software]

RedSync: Reducing synchronization bandwidth for distributed deep learning training system
Jiarui Fang, Haohuan Fu, Guangwen Yang, Cho-Jui Hsieh, Journal of Parallel and Distributed Computing 133 (JPDC), 30-39 [PDF]

2. High Performance Deep Learning System on Sunway TaihuLight Supercomputer

swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor.
Wei Gao*, Jiarui Fang*, Wenlai Zhao, Jinzhe Yang, Long Wang, Lin Gan, Haohuan Fu, Guangwen Yang. (*equal contribution), swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor, Proceedings of the 48th International Conference on Parallel Processing (ICPP 2019).
[PDF]

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight
Jiarui Fang*, and Li, Liandeng* and Haohuan Fu , Jinlei Jiang, Wenlai Zhao, Conghui He, Xin You, Guangwen Yang. (* equal contribution), IEEE Cluster (Cluster 2018), Belfast, UK, 2018.
[PDF] [Software]

swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight Supercomputer
Jiarui Fang and Haohuan Fu and Wenlai Zhao , Bingwei Chen, Weijie Zheng, Guangwen, Yang. 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS 2017)
[PDF] [PPT] [Software]

3. High Performance Geophysics Applications

Cache-friendly Design for Complex Spatially-variable Coeffcient Stencils on Many-core Architectures
Jiarui Fang, Haohuan Fu and Guangwen Yang. IEEE 23rd International Conference on High Performance Computing, Data, and Analytics (HiPC 2016), 2016.12 [PDF]

Optimizing Complex Spatially-Variant Coe cient Stencils for Seismic Modeling on GPU.
Jiarui Fang, Haohuan Fu, He Zhang, et al. IEEE 21st International Conference on Parallel and Distributed Systems(ICPADS 2015), [PDF]

GPU-based explicit time evolution method.
Jiarui Fang, Haohuan Fu, Guangwen Yang, et al The 84th Society of Exploration Geophysicists Technical Program Expanded Abstracts (SEG 2014), [PDF]