如何评价Facebook Training ImageNet in 1 Hour这篇论文?（如何评价陈巍的花滑视频）_facebook知识

机器学习方面的控制技术提问 Kaiming 在 fb 上已经很完备了，在系统设计上面我不懈努力提问许多可能会有人钟爱的难题。

的建模非常不错，我上面也不懈努力加进了许多对他该文提及的控制技术的展开说明。

的理论说明该文很有趣，我转让给他们的 co-author 啦。

另外，我知道有许多媒体和文章提及了他们没提及

的 PhD thesis 的难题，这是有心之失，在 arxiv 新版本里也总之会加上。对那个误判我也代表其它作者致歉。期望他们还是著眼控制技术，华裔在机器学习领域努力做到今天这种的负面影响很不难，多马尔坦没必要。我个人很钦佩李沐和 Kaiming 的工作，对该文的分野，能参看 Kaiming 的申明。

（1）fp16。目前 Pascal 上的纯 fp16 排序（16bit IO，16bit accumulation）在不Kangra做 tuning 的情况下是极难发散的，即使 16 bit accumulation 的 mantissa bit 不多，精确度经济损失非常大。在 Volta 上的 fp16 排序能使用 32bit accumulation。因此在该文里面他们依然保持了最常见的fp32 排序。关于 Volta 上 fp16 的体能训练，英伟达会有先期的介绍，基本的推论是比 Pascal 上的 fp16 要简单。

（2）50G Ethernet。那个确实是一个值得深入探讨的难题。Ethernet 的 latency 没 Infiniband 这种低，同时价格确实比 Infiniband 要低许多，因此说是 "commodity ethernet" 从网络系统的视角是合理的。总之，许多实际互联网比如生物医学或是 AWS 的频宽并没那么高，这一点上前日我和 Pieter Noordhuis 深入探讨了一下，如果谁期望在匀速的互联网环境下Cadours结果或是深入探讨速度对 convergence 的负面影响，热烈欢迎和他们联系。他们对此也颇钟爱，但是局限于各公司和生物医学的互联网设计各不相同，大家齐心协力分析会更全面。

（3）ResNet50 和其它 CNN 的关系。大体上大多数 CNN 的演算法，特别是排序量非常大的那些，都比较难将通信的天数暗藏到排序之中。AlexNet 是个值得一提，即使非常大的 FC 层，通信天数极难暗藏，这也是为什么 Alex Krizhevsky 要做 one weird trick 的原故。VGG 的 FC 层也有一定难题，Inception 大体上没通信的难题。

（4）Async vs sync。接上条，大多数现有产品中的 CNN 体能训练都不会需要用到 async 的演算法。从系统的视角说，虽然有知友提及 sync 是 async 的一种特例，但是用 async 实现 sync 一定程度上会经济损失效率。比如，用 general parameter server 实现 sync sgd 的话，由于 PS 的星型结构，互联网的堵塞情况会随着 worker 的增加而线性增长。虽然能通过 sharding 来解决，但是那个会造成资源浪费。另外的解决方案是每一个 worker 同时兼做 sharded parameter server，但是能发现这其实是 scatter gather allreduce 的实现。另外，数学上，sync sgd 比 async 常常更加稳定，参看 Google 的 Rethinking synchronized sgd 该文。

（5）MPI。熟悉 HPC 的同学可能发现该该文提及了 double buffer ring reduction 这些传统 MPI 的演算法。确实，在 sync sgd 的上下文里面，类 MPI 的 api 定义非常优秀，传统 HPC 也有许多这些演算法的研究。又及，Facebook 没直接使用 MPI 的原因是大多数 MPI 的实现都太重，而他们几乎只需要 Broadcast 和 Allreduce，因此他们设计了更轻的 Gloo。

（6）再说 Parameter server。上面说 async vs sync 的时候提及许多 CNN 的体能训练并不需要使用 ps。这不是说 ps 完全没用处，比如他们在体能训练 sparse + dense 联合的模型的时候，会混用 ps 和 sync sgd，甚至叠加 hogwild 这种的情况。因此基于不同的互联网排序和通信的比例，能选择不同的通信方式。

（7）对 framework 的要求。大体上的需求是上面几个：（a）需要有足够优化的数据 IO 以及 prefetching，（b）需要有基于 computation graph 的排序引擎保证 concurrent communication and computation，（c）需要 framework 的 overhead 足够低。该演算法是能在大多数现代的 DL 框架上重现的。对早期框架，比如 Caffe 或是 Torch，倒是会比较困难。

（8）Caffe2。他们重构 Caffe 到 Caffe2 的目的，一部分就是从系统上支持这些研究项目，另外一点是模型能无缝转到各应用平台，因此用 Caffe2 是一个比较自然的事情。但是他们在该该文明确指出，他们的演算法在满足（7）条件的任何框架下应该都能重现。

（9）CPU。在体能训练中 CPU 主要起到数据 preprocess 以及 GPU scheduler 的作用。确实，有足够多核的 CPU 对排序是很重要的。如果机器 GPU 足够多但是 CPU 的算力不够的话，会需要有其它的方法 - 比如在其它机器上做数据的 preprocessing，或是把 preprocessing 完全放到 GPU 上 - 这些方法来平衡 CPU 和 GPU 之间的 load。

（10）成本。那个无法估计非公开的数字，而且还有科研迭代的 cost。但是演算法出来以后，体能训练的成本其实并不比传统演算法更高，同时能在更短天数内得到结果 - 这也是为什么 strong scaling 那么重要的原故之一。按照 aws p2.8xlarge 的价格，体能训练一次大概是 230 美元，那个应用成本对一般创业公司也都是能接受的。

就想到这些，如果有需要我会再补充。 -- 转贴 Kaiming 在 Facebook 上对该文 impact 的文章 -- We had an internal debate on whether we should publish a paper describing how we can achieve the results. I agree there is not so much new, because these have been what I and my colleagues had been doing in the past few years, including how we developed ResNets and Faster R-CNN. After discussing with many people including current/formerscientists/engineers from Microsoft, Facebook, Google, Baidu, and universities, we realized that not all the details are widely known by practitioners, engineers, or researchers, and in general there had been limited success at this scale. So finally I was convinced that we should write this white paper: we hope it may be a helpful manual for people who might miss something in their systems.

In my experience, “linear scaling lr” is so surprisingly effective that it helped a lot for us to prototype and develop computer vision algorithm in the past few years, including ResNets, Faster R-CNN, and Mask R-CNN, on the old days when we didn’t have enough 8-GPU or even 4-GPU machines or when we need to migrate baselines. By “surprisingly effective” I mean that we don’t need to re-select any hyper-parameters (in contrast to picking individual lr and schedules as people usually do). This linear scaling lr is not a new thing: in our paper (Sec. 2.1 “Discussion”, p3) we cited Leon Bottou et al.’s survey paper [4] which gave theories behind linear scaling lr (and also warmup). Through personal communications with Leon we realized that this theory is so ancient and so natural that we are even not able to trace back who did it first. I hope to recommend this linear scaling lr “rule” (or theory) to broader audience as I benefited a lot from it in the past few years.

On the contrary, I had little successful experience using the “sqrt” rule: one experimental results can be found in Table 2(a) in our paper. There are discussions on the theoretical correctness of the “linear scaling” rule vs. the “sqrt” rule; but in this paper we share our rich empirical results (covering ImageNet/COCO, pre-training/fine-tuning, classification/detection/segmentation) to the readers and give strong support to the linear rule, as I have experienced in the past few years.

You mentioned “not stable” results using the linear scaling lr rule. This is consistent with our motivation of presenting the warmup techniques, which may find its theoretical support from [4]. I also benefited a lot from the warmup strategy in my research experience in the past few years, which helped me to scale out simpler and made my life much easier. We hope this can help some (if not all) researchers and engineers.