1、大规模数据处理/云计算 Lecture 1 Introduction to MapReduce,闫宏飞北京大学信息科学技术学院7/9/2013http:/ work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http:/creativecommons.org/licenses/by-nc-sa/3.0/us/ for details,Jimmy LinUniversity of Maryland,SEWMGroup,What is this co
2、urse about?,Data-intensive information processingLarge-data (“web-scale”) problemsFocus on MapReduce programmingAn entry-level course,2,What is MapReduce?,Programming model for expressing distributed computations at a massive scaleExecution framework for organizing and performing such computationsOp
3、en-source implementation called Hadoop,3,Why Large Data?,How much data?,Google processes 20 PB a day (2008)Wayback Machine has 3 PB + 100 TB/month (3/2009)Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009)CERNs LHC will generate 15 PB a year,640K
4、ought to be enough for anybody.,5,6,7,Happening everywhere!,Molecular biology(cancer),microarray chips,Particle events (LHC),particle colliders,microprocessors,Simulations (Millennium),Network traffic (spam),fiber optics,300M/day,1B,1M/sec,8,Maximilien Brice, CERN,9,Maximilien Brice, CERN,10,Maximil
5、ien Brice, CERN,11,Maximilien Brice, CERN,No data like more data!,(Banko and Brill, ACL 2001),(Brants et al., EMNLP 2007),s/knowledge/data/g;,How do we get here if were not Google?,12,Example: information extraction,Answering factoid questionsPattern matching on the WebWorks amazingly wellLearning r
6、elationsStart with seed instancesSearch for patterns on the WebUsing patterns to find more instances,Who shot Abraham Lincoln? X shot Abraham Lincoln,Birthday-of(Mozart, 1756)Birthday-of(Einstein, 1879),Wolfgang Amadeus Mozart (1756 - 1791),Einstein was born in 1879,PERSON (DATE PERSON was born in D
7、ATE,(Brill et al., TREC 2001; Lin, ACM TOIS 2007)(Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; ),13,14,Example: Scene Completion,Image Database Grouped by Semantic Content30 different F groups2.3 M images total (396 GB).Select Candidate Images Most Suitable for Filling HoleClassi
8、fy images with gist scene detector TorralbaColor similarityLocal context matching,ComputationIndex images offline50 min. scene matching, 20 min. local matching, 4 min. compositingReduces to 5 minutes total by using 5 machinesExtensionF has over 500 million images ,Hays, Efros (CMU), “Scene Completio
9、n Using Millions of Photographs” SIGGRAPH, 2007,More Data More Gains?,CNNIC中国互联网络发展状况统计截至 2010年6月底,我国网民规模达4.2亿人,互联网普及率持续上升增至31.8%。手机网民成为拉动中国总体网民规模攀升的主要动力,半年内新增 4334万,达到2.77亿人,增幅为18.6%。值得关注的是,互联网商务化程度迅速提高,全国网络购物用户达到1.4亿,网上支付、网络购物和网上银 行半年用户增长率均在30%左右,远远超过其他类网络应用。,15,2009年全国新闻出版业基本情况,2009年:出版书籍238868种(
10、初版145475种,重版、重印93393种),总印数37. 88亿册(张),总印张312.46亿印张,折合用纸量73.4万吨(包括附录用纸1.41亿印张,折合用纸量0.33万吨),定价总金额567.27亿 元(包括附录定价总金额4.73亿元)。与上年相比种数增长8.86%(初版增长11.24%,重版、重印增长5.36%),总印数增长4.53%,总印 张增长4.61%,定价总金额增长8.94%。,16,Did you know?,17,Did you know?,“We are currently preparing our students for jobs that dont yet exi
11、st ”“It is estimated that a weeks worth of the New York Times contains more information than a person was likely to come across in a lifetime in the 18th century”“The amount of new technical information is doubling every 2 years”“So what does IT ALL MEAN?”,18,“We are living in exponential times “,19
12、,20,Two Different Views,a “thrower-awayer”,MyLifeBits,“丢弃,必要时再找回来的代价要比维护它们要小得多”“trying to live an efficient life so that one has time to work and be with ones family. “,Jennifer Widom,Gordon Bell,Information Overloading,不能学以致用的原因之一:信息超载 对于那些只接触过一次的信息,我们通常只能记住其中一小部分。 我们应该少而精而非多而浅地去学习。 要想掌握某件事,关键在于间隔性
13、重复。 一旦真正透彻地掌握了自己的工作,人们就会变得更有创造性,甚至能够创造奇迹。,21,What is Cloud Computing?,The best thing since sliced bread?,Before cloudsGridsVector supercomputersCloud computing means many different things:Large-data processingRebranding of web 2.0Utility computingEverything as a service,23,Rebranding of web 2.0,Rich
14、, interactive web applicationsClouds refer to the servers that run themAJAX as the de facto standard (for better or worse)Examples: Facebook, YouTube, Gmail, “The network is the computer”: take twoUser data is stored “in the clouds”Rise of the netbook, smartphones, etc.Browser is the OS,24,Source: W
15、ikipedia (Electricity meter),Utility Computing,What?Computing resources as a metered service (“pay as you go”)Ability to dynamically provision virtual machinesWhy?Cost: capital vs. operating expensesScalability: “infinite” capacityElasticity: scale up or down on demandDoes it make sense?Benefits to
16、cloud usersBusiness case for cloud providers,I think there is a world market for about five computers.,26,Everything as a Service,Utility computing = Infrastructure as a Service (IaaS)Why buy machines when you can rent cycles?Examples: Amazons EC2, RackspacePlatform as a Service (PaaS)Give me nice A
17、PI and take care of the maintenance, upgrades, Example: Google App EngineSoftware as a Service (SaaS)Just run it for me!Example: Gmail, Salesforce,27,Utility Computing,“pay-as-you-go” 好比让用户把电源插头插在墙上,你得到的电压和Microsoft得到的一样,只是你用得少,pay less;utility computing的目标就是让计算资源也具有这样的服务能力,用户可以使用500强公司所拥有的计算资源,只是us
18、e less pay less。这是cloud computing的一个重要方面,28,Platform as a Service (PaaS),对于开发Web Application和Services,PaaS提供了一整套基于Internet的,从开发,测试,部署,运营到维护的全方位的集成环境。特别它从一开始就具备了Multi-tenant architecture,用户不需要考虑多用户并发的问题,而由platform来解决,包括并发管理,扩展性,失效恢复,安全。,29,Software as a Service (SaaS),a model of software deployment w
19、hereby a provider licenses an application to customers for use as a service on demand.,30,Who cares?,Ready-made large-data problemsLots of user-generated content Even more user behavior dataExamples: Facebook friend suggestions, Google ad placementBusiness intelligence: gather everything in a data w
20、arehouse and run analytics to generate insightUtility computingProvision Hadoop clusters on-demand in the cloudLower barrier to entry for tackling large-data problemCommoditization and democratization of large-data capabilities,31,Story around Hadoop,Google-IBM Cloud Computing Initiative,2007年10月初,G
21、oogle和IBM联合与6所大学签署协议,提供在大型分布式计算系统上开发软件的课程和支持服务,帮助学生和研究人员获得开发网络级应用软件的经验。这个项目的主要内容是传授MapReduce算法和Hadoop文件系统。两家公司将各自出资20002500万美元,为从事计算机科学研究的教授和学生提供所需的电脑软硬件和相关服务。,33,Cloud Computing Initiative,Google and IBM team on cloud computing initiative for universities(2007)provide several hundred computers acce
22、ss through the Internet to test parallel programming projects The idea for the program from Google senior software engineer Christophe BiscigliaGoogle Code University,34,35,The Information Factories,Googleplex(pre-2008) servers number 450,000, according to the lowest estimate200 petabytes of hard di
23、sk storagefour petabytes of RAMTo handle the current load of 100 million queries a day,input-output bandwidth must be in the neighborhood of 3 petabits per second,Google Infrastructure,2003 The Google file system, in sosp. Bolton Landing, NY, USA: ACM Press, 2003. 2004 MapReduce: Simplified Data Pro
24、cessing on Large Clusters, in osdi, 2004, 2006 Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!), in osdi, 2006.,36,37,Hadoop Project,Doug Cutting,38,History of Hadoop,2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by D
25、oug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of th
26、e Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006
27、 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2007 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes April 2008: Won the 1-terabyte sort benchmark in 209 seconds on 900 nodes.October 2008: Loading 1
28、0 terabytes of data per day onto research clusters.March 2009: 17 clusters with a total of 24,000 nodes.April 2009: Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100-terabyte sort in 173 minutes (on 3,400 nodes).,Google Code University,2008, Seminar: Mass Data Processi
29、ng Technology on Large Scale Clusters, Tsinghua University,Aaron Kimball,39,Startup: Cloudera,Cloudera is pushing a commercial distribution for Hadoop,Mike Olson,Aaron Kimball,Doug Cutting,Christophe Bisciglia,Tom White,40,Course Administrivia,TextBooks,Tom Tom White, Hadoop: The Definitive Guide, OReilly, 3rd, 2012.5.Lin Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, 2013.1.,42,43,This schedule is tentative and subject to change without notice,Recap,Why large data?Cloud computingStory about Hadoop,44,
Copyright © 2018-2021 Wenke99.com All rights reserved
工信部备案号:浙ICP备20026746号-2
公安局备案号:浙公网安备33038302330469号
本站为C2C交文档易平台,即用户上传的文档直接卖给下载用户,本站只是网络服务中间平台,所有原创文档下载所得归上传人所有,若您发现上传作品侵犯了您的权利,请立刻联系网站客服并提供证据,平台将在3个工作日内予以改正。