学术信息

Performance Challenges in Distributed Deep Neural Network Systems

发布时间:2018-10-15 

Performance Challenges in Distributed Deep Neural Network Systems

时间:10月15日 (周一) 15:00

地点:遗传学楼308

报告人:周承复

邀请人:叶士青 (微纳中心)

 

The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. To successfully train neural networks with many hidden layers, efficient algorithms, large datasets, and large scale data centers all play a fundamental role. When scaling up is not enough, data and computation must be distributed among multiple nodes, e.g., multiple CPUs, GPUs, or even FPGAs with dedicated hardware architectures. In this talk, we first review the solutions as well as performance issues in practice to distribute stochastic gradient descent computations at the scale of data centers. Then, we present some initial results from our recent work, which addresses two important problems for the application of this strategy to large-scale clusters and multiple, heterogeneous jobs. The first one is we propose and validate a queuing model to estimate the throughput of a training job as a function of the number of nodes assigned to the job. Throughput estimations are then used to explore several classes of scheduling heuristics to reduce response time in a scenario where heterogeneous jobs are continuously submitted to a large-scale cluster. Heuristics are evaluated through extensive simulations of realistic DNN workloads, also investigating the effects of early termination, a common scenario for DNN training jobs.
 

周承复教授 简介:        

Cheng-Fu Chou received the M.S. and Ph.D. degrees from the University of Maryland, College Park, MD, USA, in 1999 and 2002, respectively. After his graduation, he joined the Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, where he is currently a Professor. Since 2002, he has been a Visiting Scholar at the Department of Computer Science, University of Southern California, Los Angeles, CA, USA, in 2002 and 2017-2018. His research interests include distributed machine learning systems, software-defined networking, wireless networks, and their performance evaluation.

版权所有©复旦大学信息科学与工程学院微纳系统中心版权所有

地址:淞沪路2005号复旦大学江湾校区交叉学科二号楼7楼 电话:021-31242626