Scaling laws for artificial intelligent (AI) models underpin the expansion and enhancement of AI capabilities, empowering systems to learn and respond with unprecedented accuracy and granularity. As the scale of the AI computing cluster increases, ensuring the reliability of such large-scale systems for training and inference is facing great challenges.
On one hand, both training and inference with large-scale model rely on complex interconnected systems. Different parts, processors, distributed nodes have strong interactions and dependency in distinct parallel domains. As the system grows in size, the risk of data inconsistency, deadlocks, starvation rises. Any network interruptions or slow responses may cause the entire system to enter a sub-health state, or even a complete outage. The reliability bottleneck becomes a crucial aspect to be addressed.
On the other hand, due to the black-box nature of many large-scale models, it is hard to identify and analyze faulty phenomenon under various software and hardware configurations. Soft errors and computational failures may manifest themselves in different forms, posing new challenges to fault management as well as higher requirements for the implementation of fault-tolerant systems.
This workshop focuses on the reliability of complex machine learning systems. Relevant discussions may range from attempts to combine theory and engineering of distributed parallel systems to enhance system resilience, to more general approaches for modelling and optimizing large-scale AI training and inference built upon them.
The list of topics includes, but is not limited to:
Authors are invited to submit original unpublished research papers as well as industrial practice papers. Simultaneous submissions to other conferences are not permitted. Detailed instructions for electronic paper submission, panel proposals, and review process can be found at QRS submission.
Each submission can have a maximum of ten pages. It should include a title, the name and affiliation of each author, a 300-word abstract, and up to 6 keywords. Shorter version papers (up to six pages) are also allowed.
All papers must conform to the QRS conference proceedings format (PDF | Word DOCX | Latex) and Submission Guideline set in advance by QRS 2025. At least one of the authors of each accepted paper is required to pay the full registration fee and present the paper at the workshop. Submissions must be in PDF format and uploaded to the conference submission site. Arrangements are being made to publish extended version of top-quality papers in selected SCI journals.
SubmissionName | Affiliation | Geographic Region |
---|---|---|
Xiaowen Chu | The Hong Kong University of Science and Technology (Guangzhou) |
China |
Jianhui Jiang | Tongji University | China |
Jingwen Leng | Shanghai Jiao Tong University | China |
Guiying Li | Southern University of Science and Technology | China |
Shaohuai Shi | Harbin Institute of Technology | China |