Abstract:With the growth of machine learning algorithm models and data sets, a single node cannot effectively bear the computing and storage requirements required for large -scale training. A common solution is to run large-scale machine learning algorithms on distributed clusters. However, the performance of distributed clusters is significantly affected by stragglers. In recent studies, researchers have used coding calculations to solve the straggler problem, but the performance of coding calculation schemes for large-scale matrix multiplication has not been fully studied and analyzed. This paper examines the task completion time of the coding calculation scheme for large-scale matrix multiplication, and considers the total calculation overhead of all nodes participating in distributed computing. The expression of the task completion time for each working node to complete the calculation task according with the total time under the uniform distribution scenario and the total computing time of the cluster machines is given. The performance of the three coding schemes is compared and analyzed. The effects of different situations on the task completion time and the total computing cost of the computing node are compared through experiments, and a heuristic algorithm is proposed to provide the basis for the selection of different coding calculation schemes.