
The Korea Advanced Institute of Science and Technology (KAIST) announced on Thursday that a research team led by Min Soo Yoo has developed a simulation tool capable of predicting and optimizing the training time of LLMs in large-scale distributed systems.
The training of these massive language models requires thousands of GPUs, with training time and costs varying dramatically based on the parallelization strategies employed.
However, due to the enormous computational costs and time involved, companies have been criticized for relying on only a handful of proven strategies, leading to inefficient use of GPU resources.
Yoo’s team conducted an in-depth analysis of distributed parallelization strategies for large language models and developed vTrain, a simulation framework that can estimate training times.
vTrain represents the training process as computational units based on an execution graph and predicts overall training time by profiling the execution time of individual tasks.
To achieve this, the team introduced a method for generating execution graphs that effectively represent communication patterns based on parallelization techniques and an operation selection method that reduces profiling overhead.
To validate vTrain’s prediction accuracy, the researchers compared actual training times measured in multi-GPU environments with vTrain’s predictions.
The results showed impressive reliability, with an average absolute error of 8.37% in a single-node environment (8 A100 GPUs) and a 14.73% error range in a multi-node setup (up to 512 A100 GPUs).
Notably, optimization strategies using vTrain improved GPU efficiency by over 10% and reduced training costs by more than 5% compared to existing methods.
To support the wider AI research community, the team has made the vTrain framework and over 1,500 training time measurement datasets available as open-source resources.
Yoo explained, “vTrain employs a profiling-based simulation technique that surpasses traditional empirical methods in increasing GPU efficiency and reducing training costs. By releasing it as open-source, we aim to help companies significantly cut the expenses of training ultra-large AI models.”
This groundbreaking research was presented at the joint international conference of the Institute of Electrical and Electronics Engineers (IEEE) and the Association for Computing Machinery (ACM).