分布式网络爬虫的设计与实现——爬虫节点程序的设计与实现---毕业论文.doc

资源描述

1、本科毕业论文分布式网络爬虫的设计与实现爬虫节点程序的设计与实现 The Design and Implementation of Distributed Web Crawler The Design and Implementation of Crawler Node 姓名：学号：学院：软件学院系：软件工程专业：软件工程年级：指导教师：年月摘要搜索引擎是从互联网上快速而有效地获取信息资源的捷径。网络爬虫是搜索引擎的重要组成部分，它在搜索引擎中负责网络信息采集，是搜索引擎数据库中原始信息唯一的来源。本文围绕着网络搜索这一前沿技术，深入研究了网络

2、爬虫的工作原理和相关技术，并在这些研究工作的基础之上设计实现了一个高性能分布式网络爬虫系统。本文研究了搜索引擎的发展历程，从中了解了网络爬虫的应用意义和实际价值，而后又对网络爬虫的历史和发展现状进行了学习和研究，总结前人的经验，为自身的研究奠定好坚实的基础。在对网络爬虫研究背景有了较深的了解之后，开始着手研究和分析分布式网络爬虫节点现有的实现技术，包括爬行的策略、网页的测评算法、 HTML 网页文档的分析、多线程的使用、不同网页编码转换、爬虫的优雅爬行等等。并将这些关键技术应用于分布式网络爬虫节点当中。完成了对理论知识的理解和掌握，接下来就是应用到实际的工作当中。主要对分布式

3、网络爬虫节点进行基础的逻辑分析；对爬虫节点进行了功能模块的划分，使得各个模块的功能都有了详细的分配；然后对爬虫节点的工作流程进行详尽的设计；最后综合设计思想，完成对爬虫节点具体类结构的设计工作。最终实现了一个分布式网络爬虫的程序原型，通过在互联网上进行实验，检验了网络爬虫节点的运行效果，由此验证了分布式网络爬虫的可行性和有效性。关键词：并行；网络爬虫节点；信息采集； Abstract A search engine is a shortcut to access to information resources. As an important com

4、ponent of a search engine, web crawler is responsible for web information collection, which is the only source of original information in search engine database. This paper revolves around the cutting-edge web search technology, reptile-depth study of the theory and related technologies of web crawl

5、er. A high performance distributed web crawler is designed and implemented based on this knowledge. This paper research the development of search engines, understanding the application meaning and real value of the web crawler from the research. And then learning and research the history and develop

6、ment of the web crawler. Summing up the experience of their predecessors, lay a solid foundation for the research. With deeper understanding of the research background of the web crawler, I then start to research and analyze the technology of the distirbuted web crawler node, including crawl strateg

7、y, web page evaluation, HTML web document analysis, using multithreading, change web page coding, polite crawl etc. And use the key technology in the implementation of the distirbuted web crawler node. When understand and mastery of theoretical knowledge, the next is applied to the actual work. Anal

8、yze the base logic of the distributed web crawler node; partition the function of each module of the carwler node, make the distribution of the functions for each module in details; detailed design the work process of the web crawler node; finally, integrated the design, complete the detailed design

9、 of the class structure. Impplement a prototype of the distributed web carwler system, through the experiment in the internet, tested the operation of the distributed web crawler node, testify the feasibility and effectiveness of the distributed web crawler. Key words: parallel; web crawler node; in

10、formation retrieval; 目录第一章绪论 . 1 1.1 课题研究背景 . 1 1.1.1 搜索引擎的发展 . 1 1.1.2 网络爬虫的研究及应用意义 . 4 1.2 工作的目的与意义 . 5 1.3 主要工作内容简述 . 6 1.4 本文组织结构 . 6 第二章网络爬虫相关知识研究与关键技术概述 . 8 2.1 网络爬虫相关知识研究 . 8 2.1.1 网络爬虫的研究历史 . 8 2.1.2 网络爬虫的发展现状 . 9 2.2 网络爬虫关键技术概述 . 11 2.2.1 网络爬虫的爬行策略 . 11 2.2.2 网页评测算法 . 12 2.2.3 网页解析 . 14

11、 2.2.3.1HTML 语法分析 . 14 2.2.3.2 页面链接的提取 . 17 2.2.4优雅采集 . 17 2.2.5多线程技术 . 19 2.2.5.1 多线程概述 . 19 2.2.5.2 线程带来的问题和解决办法 . 19 2.2.5.3 多线程在网络爬虫节点中的使用 . 20 2.2.6消除重复网页 . 21 2.2.7网页的存储 . 22 第三章分布式网络爬虫节点设计详解 . 24 3.1 分布式网络爬虫节点基础逻辑设计 . 24 3.2 分布式网络爬虫节点结构设计 . 25 3.2.1 下载模块 . 27 3.2.2 网页解析模块 . 28 3.2.2.1 主要解析流

12、程 . 28 3.2.2.2 网页编码转换 . 29 3.2.3 数据库存储模块 . 30 3.2.4 优雅采集模块 . 31 3.2.5 任务定位模块 . 32 3.2.6 节点通信模块 . 32 3.3 分布式网络爬虫节点详细程序设计 . 33 3.3.1 整体框架 . 33 3.3.2 爬行节点详细流程 . 34 3.3.3 分布式网络爬虫节点类结构设计 . 36 3.3.3.1 爬虫节点类的整体设计 . 36 3.3.3.2 主要类的职责和详细工作内容 . 37 第四章系统实现与测试分析 . 39 4.1 软件系统实现说明 . 39 4.2 软件实验测评分析 . 43 第五章结束

13、语 . 46 参考文献 . 47 致谢 . 50 Contents Chapter 1 Introduction. 1 1.1 Research background . 1 1.1.1 Development of search engine . 1 1.1.2 Research and application significance of Web crawler. 4 1.2 The purpose and significance of work . 5 1.3 Outlined the main work . 6 1.4 The organizational structure .

14、6 Chapter 2 Research of web crwaler knowledge and key technology . 8 2.1 Research of web crawler knowledge . 8 2.1.1 History of web crawler research . 8 2.1.2 Development Status of web crawler . 9 2.2 An overview of key technologies. 11 2.2.1 Crawl strategy . 11 2.2.2 We page evaluation algorithm .

15、12 2.2.3 Web page analytic . 14 2.2.3.1 HTML parsing . 14 2.2.3.2 Extraction of page links . 17 2.2.4 Polite crawl . 17 2.2.5 Multi-threading technology. 19 2.2.5.1 Overview of multi-threading technology . 19 2.2.5.2 The problems and solutions of using threads . 19 2.2.5.3 Using multi-threading in t

16、he web crawler . 20 2.2.6 The elimination of duplication of the page . 21 2.2.7 Web page storage . 22 Chapter 3 Distributed web crawler node detailed design . 24 3.1 The basis of logic design . 24 3.2 Structural Design . 25 3.2.1 Download module . 27 3.2.2 Web page analysis module . 28 3.2.2.1 The m

17、ain analytical process. 28 3.2.2.2 Code page conversion . 29 3.2.3 Database storage module. 30 3.2.4 Polite crawl module . 31 3.2.5 Task of positioning module . 32 3.2.6 Node communication module . 32 3.3 Detailed program design . 33 3.3.1 The overall framework . 33 3.3.2 Detailed process of web cra

18、wler node . 34 3.3.3 Class structural design . 36 3.3.3.1 Overall design of classes . 36 3.3.3.2 Responsibilities of main classes and detailed work . 37 Chapter 4 Systems analysis and testing . 39 4.1 Description of implement system. 39 4.2 Evaluation Analysis . 43 Chapter 5 Concluding remarks . 46

19、References . 47 Thanks. 50 第一章绪论 - 1 - 第一章绪论 1.1 课题研究背景近年来 ,Internet 迅速发展成为一个分布于全球的混合信息空间。互联网络已经逐渐成为人们获取信息资源不可或缺的巨大信息源，而搜索引擎也成为了在网络信息世界中准确而快速寻找信息的首要途径。搜索引擎的出现和发展在一定程度上满足人们需要的同时，也面临着更多的挑战。网络搜索技术主要包括信息采集和信息处理两方面，网络爬虫（ Crawler）是搜索引擎用于信息采集的程序，是搜索引擎的重要组成部分。 1.1.1 搜索引擎的发展上世纪九十年代以来，互联网的迅猛发展，

20、Web 信息爆炸式增长。用户要在互联网的信息海洋里查找信息，就像大海捞针一样。搜索引擎的出现恰好解决了这个难题。搜索引擎可以为用户提供互联网信息检索服务，目前，它正成为计算机工业界和学术界争相研究的对象。搜索引擎最早出现于 1994 年。 M.Mauldin将 J.Leavitt 的网络爬虫接入到其索引程序中，创建了 Lycoso。同年，斯坦福大学的两名博士生 D.Filo 和杨致远（ Gerry Yang）共同创办了超级目录索引Yahoo，使搜索引擎的概念深入人心。从此搜索引擎进入了高速的发展阶段。Google1是斯坦福大学在 1998 年的实验型搜索引擎基础上发展而成的，它多年被用户

21、评为最受欢迎的搜索引擎，究其原因一是它的网页索引数量达到 30 亿之多。 Google 的响应速度和结果的质量方面强于其它搜索引擎。该搜索引擎的关键技术有 PageRank1 技术和超文本匹配分析技术（ Hypertext-Matching Analysis） HMA。目前的搜索引擎大致可以分为三类，目录搜索引擎。机器人搜索引擎和元搜索引擎。目录搜索引擎主要依靠人工维护网站索引，它虽然有搜索功能，但在严格意义算不上是真正的搜索引擎，仅仅是按目录分类的网站链接列表而已。用户完全第一章绪论 - 2 - 可以不用进行关键词查询，仅靠分类目录也可找到需要的信息。国外比较著名的目录索引搜索引

22、擎有 Yahoo， Open Directory Project， LookSmart 等；国内的搜狐，新浪，网易搜索也都具有这一类功能。目录式搜索引擎分类结构清晰、错误较少，比较符合人们的阅读习惯。缺点是工作人员多、整理周期长、速度慢、人工干预成份多，不能适应 Web 资源的规模发展。机器人搜索引擎是名副其实的搜索引擎，国外代表性的有 Google，YahooSearch， MSN Search；国内著名的有百度，中搜。它们都是通过从互联网上提取的各个网站的信息而建立的数据库，检索与用户查询条件匹配的相关记录，然后按一定的排列顺序将结果返回给用户，也是目前常规意义上的搜索引擎。机器人搜索

23、引擎的全部工作基本上由程序自动完成，人工参与成份很少。它通过网络爬虫在网上采集信息，将搜索到的网页自动地加入到本地索引数据库中，用户可以很快从索引数据库查到更新后的信息。如果某个网站的网页内容更新了，搜索引擎会自动发现这些变化，并很快更新本地索引数据库，及时反映到用户的检索结果中。它的优势在于自动化程度高、维护费用低，更强调技术上的创新和提高，也更适合于开展研究工作，因而成为当前研究的热点。机器人搜索引擎通常有三大模块：信息采集模块、信息处理模块、信息查询模块。图 1-1以 Google 为例，描述了一个典型的搜索引擎系统架构。图 1-1 Google 的系统结构图第一章绪论 -

24、 3 - 搜索引擎的各组成部分相互交错相互依赖。机器人搜索引擎的实现原理，可归结为四个步骤： 1、从互联网上抓取网页。利用能够从互联网上自动收集网页的网络爬虫程序，自动访问互联网，并沿着任何网页中的所有 URL 爬到其它网页，起初的 URL并不多，随着信息采集量的增加，也就是分析到网页有新的链接，就会把新的URL 添加到 URL 列表，以便继续采集网页，重复这过程，并把爬过的所有网页收集到服务器中。 2、建立索引数据库。由索引系统程序对收集回来的网页进行分析，提取相关网页信息（包括网页所在 URL、编码类型、页面内容包含的关键词、关键词位置、生成时间、大小、与其它网页的链接关系等），根

25、据一定的相关度算法进行大量复杂计算，得到每一个网页针对页面内容中及超链中每一个关键词的相关性（或重要性），然后用这些相关信息建立网页索引数据库。 3、在索引数据库中搜索。当用户输入关键词搜索后，分解搜索请求，由搜索系统程序从网页索引数据库中找到符合该关键词的所有相关网页。 4、对搜索结果进行处理排序。所有相关网页针对该关键词的相关信息在索引库中都有记录，只需综合相关信息和网页级别形成相关度数值，然后进行排序，相关度越高，排名越靠前。最后由页面生成系统将搜索结果的链接地址和页面内容摘要等内容组织起来返回给用户。元搜索引擎是指在统一的用户查询界面与信息反馈形式下，共享多个搜索引擎的资源库为用户

26、提供检索服务的系统。著名的元搜索引擎有 Dogpile，Vivisimo 等；国内的元搜索引擎中具代表性的有搜星搜索引擎，优客搜索。在搜索结果排列方面，有的直接按来源引擎排列搜索结果，如 Dogpile，有的则按自定的规则将结果重新排列组合，如 Vivisimo。元搜索引擎的优势在于用户不需要记忆不同搜索引擎的地址和查询语法就能查询多个索引数据库，可以大大提高查询结果的覆盖度，不用维护庞大的索引数据库，而将工作重心放在检索结果的整合上，提高查询的准确度。但是元搜索引擎的网络资源开销比较大，从多个搜索引擎返回的结果中常常有很多重复信息，相关度排序十分困难。这三种搜索引擎各有优缺点，在不同的领域有不同的应用。目录式搜索引擎和全文检索搜索引擎现在己经紧密结合在一起，如 Google、北大天网等等，它

展开阅读全文