谷歌大规模排序实验的历史[翻译].doc

资源描述

1、原文链接： https:/ Dvorsky，软件工程师，谷歌云平台History of massive-scale sorting experiments at Google谷歌大规模排序实验的历史Thursday, February 18, 2016星期四，2016 年 2 月 18 日Weve tested MapReduce by sorting large amounts of random data ever since we created the tool. We like sorting, because its easy to generate

2、an arbitrary amount of data, and its easy to validate that the output is correct.我们发明了 MapReduce 这个工具之后，对它进行了大规模随机数据的排序测试。我们喜欢排序，因为很容易产生任意规模的数据，也很容易验证排序的输出是否正确。Even the original MapReduce paper reports a TeraSort result. Engineers run 1TB or 10TB sorts as regression tests on a regular basis, because

3、 obscure bugs tend to be more visible on a large scale. However, the real fun begins when we increase the scale even further. In this post Ill talk about our experience with some petabyte-scale sorting experiments we did a few years ago, including what we believe to be the largest MapReduce job ever

4、: a 50PB sort.我们最初的 MapReduce 论文就报道了一个 TeraSort 排序的结果。工程师在一定的规则基础上对 1TB 或 10TB 的数据进行排序测试，因为细小的错误更容易在大规模数据运行的时候被发现。然而，真正有趣的事情在我们进一步扩大数据规模后才开始。在这篇文章中，我将讲一讲我们在几年之前所做的一些 PB 级别的排序实验，包括我们认为是目前最大的 MapReduce 工作：50PB 排序。These days, GraySort is the large scale sorting benchmark of choice. In GraySort, you mus

5、t sort at least 100TB of data (as 100-byte records with the first 10 bytes being the key), lexicographically, as fast as possible. The site sortbenchmark.org tracks official winners for this benchmark. We never entered the official competition.那时候，GraySort 是大型排序基准的选择。在 GraySort 基准下，你必须按照尽快对至少 100TB

6、的数据(每 100B 数据用最前面的 10B 数据作为键)进行字典序排序。Storbenchmark.org 这个网站追踪报道了这个基准的官方优胜者。而我们从未正式参加过比赛。MapReduce happens to be a good fit for solving this problem, because the way it implements reduce is by sorting the keys. With the appropriate (lexicographic) sharding function, the output of MapReduce is a seque

7、nce of files comprising the final sorted dataset.MapReduce 是解决这个问题的一个不错选择，因为它实现减少(优化)的方法是对通过对键进行排序。结合适当的(字典)分区功能，MapReduce 的输出是一组包含了最终排序数据的文件序列。Once in awhile, when a new cluster in a datacenter came up (typically for use by the search indexing team), we in the MapReduce team got the opportunity to

8、play for a few weeks before the real workload moved in. This is when we had a chance to “burn in” the cluster, stretch the limits of the hardware, destroy some hard drives, play with some really expensive equipment, learn a lot about system performance, and, win (unofficially) the sorting benchmark.

9、偶尔，当一个新的 cluster 在一个数据中心出现时(通常被搜索索引团队所使用)，我们 MapReduce 团队就得到一个机会在真正的工作到来之前运行若干星期。这是我们有机会去“燃烧”这个 cluster，延伸硬件的限制，放弃一些硬盘，而使用一些真正昂贵的设备，了解系统的性能，并赢得(非正式)排序基准。Figure 1: Google Petasort records over time.图 1：谷歌 Petasort 时间记录20072007(1PB, 12.13 hours, 1.37 TB/min, 2.9 MB/s/worker)(1PB, 12.13 小时， 1.37TB/min

10、，2.9MB/s/worker)We ran our very first Petasort in 2007. At that time, we were mostly happy that we got it to finish at all, although there are some doubts that the output was correct (we didnt verify the correctness to know for sure). The job wouldnt finish unless we disabled the mechanism which che

11、cks that the output of a map shard and its backup are the same. We suspect that this was a limitation of GFS (Google File System), which we used for storing the input and output. GFS didnt have enough checksum protection, and sometimes returned corrupted data. Unfortunately, the text format used for

12、 the benchmark doesnt have any checksums embedded for MapReduce to notice (the typical use of MapReduce at Google uses file formats with embedded checksums).2007 年我们运行了第一个 Petasort。在那个时候，我们最高兴的是这个程序最终完成了排序，尽管我们对排序的结果有一些疑问(我们没有验证排序结果的正确性)。如果不是我们取消了一定要某一个输出分区与备份完全相同的验证机制，这个排序便不会结束。我们怀疑这是因为我们用来存储输入和输出的

13、文件是GFS 格式(谷歌文件系统 )的缘故。GFS 文件没有足够校验和保护，有时会返回被污染的数据。不幸的是，这个基准所使用的文件格式没有任何嵌入式校验供MapReduce 使用 (谷歌使用的典型 MapReduce 的文件是有嵌入式校验的 )。20082008(1PB, 6.03 hours, 2.76 TB/min, 11.5 MB/s/worker)1PB, 6.03 小时，2.76TB/min，11.5MB/s/worker2008 was the first time we focused on tuning. We spent several days tuning the num

14、ber of shards, various buffer sizes, prefetching/write-ahead strategies, page cache usage, etc. We blogged about the result here. The bottleneck ended up being writing the three-way replicated output GFS, which was the standard we used at Google at the time. Anything less would create a high risk of

15、 data loss.2008 年我们第一次把注意力集中于调整。我们花费几天的时间来调整分区的数量，缓冲区的大小，预取/预写策略，页面缓存使用等。我们曾经在这个博客里记录过结果。最终的瓶颈是写三路复制的 GFS 输出文件，这是当时我们在谷歌使用的标准。任何事情的缺失都会造成数据丢失的高风险。20102010(1PB, 2.95 hours, 5.65 TB/min, 11.8 MB/s/worker)1PB, 2.95 小时, 5.65 TB/min, 11.8 MB/s/workerFor this test, we used the new version of the GraySort

16、benchmark that uses incompressible data. In the previous years, while we were reading/writing 1PB from/to GFS, the amount of data actually shuffled was only about 300TB, because the ASCII format used the previous years compresses well.在这个测试中，我们使用了一种新的不可压缩的 GraySort 基准的数据版本。在前几年，当我们读/写 1PB GFS 文件时，实际

17、上混排的数据只有 300TB，因为前几年的数据是用 ASCII 格式压缩好的。This was also the year of Colossus, the next generation distributed storage system at Google, the successor to GFS. We no longer had the corruption issues we encountered before with GFS. We also used Reed-Solomon encoding (a new Colossus feature) for output tha

18、t allowed us to reduce the total amount of data written from 3 petabytes (for three-way replication) to about 1.6 petabytes. For the first time, we also validated that the output was correct.这也是谷歌使用 Colossus 的一年，新一代的分布式存储方式取代了 GFS。我们不再有我们遇到过的 GFS 文件污染的问题。我们还使用了 Reed-Solomon 编码(Colossus 新特征) 作为输出，这种编

19、码允许我们减少数据的总量，三路复制数据从 3 字节减少到了大约 1.6 字节。这也是第一次，我们验证了输出的结果是正确的。To reduce the impact of stragglers, we used a dynamic sharding technique called reduce subsharding. This is the precursor to fully dynamic sharding used in Dataflow.为了减少人的影响，我们采用了一种叫做减少残余碎片的动态分区技术。这也是数据流采用全动态分区的先兆。20112011(1PB, 0.55 hours,

20、 30.3 TB/min, 63.1 MB/s/worker)1PB, 0.55 小时, 30.3 TB/min, 63.1 MB/s/workerThis year we enjoyed faster networking and started to pay more attention to per-machine efficiency, particularly in I/O. We made sure that all our disk I/Os operations were performed in large 2MB blocks versus sometimes as sma

21、ll as 64kB blocks. We used SSDs for part of the data. That got us the first Petasort in under an hour 33 minutes, to be exact and we blogged about it here. We got really close (within 2x) to what the hardware was capable of given MapReduces architecture: input/output on a distributed storage, persis

22、ting intermediate data to disk for fault tolerance (and fault tolerance was important, because it would be common that some disks or even whole machines fail during an experiment at this scale).这一年，我们有了更快的网络，并开始更加注重每台机器的效率，特别是 I/O的效率。我们确保我们所有的 I/O 操作都在大于 2MB 的空间内进行，而以前有时候会小到 64kB。对于部分数据我们使用固态硬盘。这使得我

23、们第一个在一小时之内完成 Petasort更确切地说是 33 分钟我之前在博客中有讲到。我们真的非常接近(两倍) 给定 MapReduces 体系结构的硬件极限：输入/输出分布式存储，为容错保留中间数据(容错是非常重要的，因为它是共有的，而这个试验中一些磁盘甚至整个机器都会引起这个规模的失败)。We also scaled higher, and ran 10PB in 6 hours 27 minutes (26TB/min).我们还运行了更大的数据，10PB 数据在 6 小时 27 分钟(26TB/min)。20122012(50PB, 23 hours, 36.2 TB/min, 50

24、 MB/s/worker)50PB, 23 小时, 36.2 TB/min, 50 MB/s/workerFor this test, we shifted our attention to running at a larger scale. With the largest cluster at Google under our control, we ran what we believe to be the largest MapReduce job ever in terms of amount of data shuffled. Unfortunately, the cluster

25、 didnt have enough disk space for a 100PB sort, so we limited our sort to “only” 50PB. We ran it a single time and did so without specific tuning and we used the settings from our previous 10PB experiment. It finished in 23 hours, 5 minutes.对于这个测试，我们将注意力转移到更大规模的数据。谷歌我们控制的最大cluster 之下，我们运行了我们认为是最大 Ma

26、pReduce 工作，甚至在数据混排规模方面也是最大。不幸的是，这个 cluster 没有足够的磁盘空间来排序 100PB 的数据，因而限制我们的排序“只有”50PB。我们只运行了一次，并没有进行专门的调整，只是使用了我们之前 10PB 实验时候的设置。这次实验在 23 小时 5分钟之后运行结束。Note that this sort was 500 times larger than the GraySort large-scale requirement and twice as fast in throughput as the current 2015 official GraySor

27、t winner.需要注意的是，这次排序的规模是 GraySort 大规模的要求的 500 倍，计算速率是 2015 年 GraySort 官方优胜者的两倍。Lessons learned经验总结These experiments taught us a lot about the challenges of running at the scale of 10,000 machines, and about how to tune to run at speeds close to what the hardware is capable of.这些实验教会了我们很多关于运行超过 10000

28、台机器的挑战，以及如何调整使得运行速度接近于硬件的极限。While these sorting experiments are fun, they have several drawbacks:虽然这些排序实验室是有趣的，它们仍然有几个缺点：1. Nobody really wants a huge globally sorted output. We havent found a single use case for the problem as stated.1. 没有人真的想要一个巨大的全球范围内的排序输出。我们还没有找到解决这个问题的单一使用方案。2. They show that

29、 the system is capable of performing well, although they de-emphasise how much effort it is. MapReduce needed a lot of tuning to perform this well. Indeed, we saw many MapReduce jobs in production that perform poorly due to what we would call misconfiguration.2. 这些实验显示这个系统是可以良好运行的，但是需要强调的是我们花费了很多努力。

30、MapReduce 需要大量的调试确保它顺利运行。我们看到很多MapReduce 运行很差，而原因可以认为是设置不当。Lately, weve been instead focusing on building systems that make tuning for the most part unnecessary. For example, Dataflow figures out the number of shards automatically (and dynamically reshards as necessary) which takes the manual guessw

31、ork out of the picture. But well leave this topic, and the results achieved, for future blog posts.最近，我们已经把注意力集中在构建系统上，目的是使得大部分调整都不再是必要的。例如，采用数据流自动计算出分区的数量(必要时采用动态再分区)，而不是人为的经验性分区。但是我们将结束这个主题，已经取得的工作，将在今后的博客文章中介绍。Posted by Marian Dvorsky, Software Engineer, Google Cloud Platform作者：Marian Dvorsky，软件工程师，谷歌云平台

展开阅读全文