面向主题的网络爬虫——网页信息抽取---毕业论文.doc-资源下载-文客久久网

面向主题的网络爬虫——网页信息抽取---毕业论文.doc

1、本科毕业论文面向主题的网络爬虫网页信息抽取 Subject-oriented crawler Web information extraction 姓名：学号：学院：软件学院系：软件工程专业：软件工程年级：指导教师：年月 I 摘要随着 Internet 的迅猛发展，网络资源急剧增加，信息更加多元化，同时给通用搜索引擎带来了极大的挑战。因为通用搜索引擎是面向所有的 Web 信息检索者的，庞大的网络信息规模和高速响应要求，使其检索结果不尽人意。主题搜索引擎，是为了进一步提高搜索结果相关度的新一代搜索引擎。它提供了分类更准确、

2、数据更全面、更新更及时的因特网搜索服务。主题搜索引擎中的信息采集，以及主题爬虫系统的搜索策略的研究，对主题搜索引擎的应用与发展都具有非常重要的作用。本文在对搜索引擎的演变和发展进行全面的综述以后，对通用搜索引擎和主题搜索引擎进行了性能的比较。引出主题搜索引擎的重要组成部分主题爬虫，并分析了主题爬虫的基本结构和工作原理。随后，对网络爬虫的一些经典页面相似度算法进行了评价。同时，重点讨论了 URL 搜索策略，介绍了我们的网络爬虫系统对 Web 页面的信息采集，并加以实现。最后，展示了我们实现的网络爬虫。论文主要研究了以下 4 个问题： (1) 对主题爬虫

3、的工作原理，功能模块及基本技术进行了研究。 (2) 讨论了主题爬虫经典的页面相似度算法：基于链接的页面相似度算法和基于内容的页面相似度算法；讨论了 URL 搜索策略算法。 (3) 研究并实现了基于 HTML 的网页的信息抽取。 (4) 实现了本系统的界面。关键词：网络爬虫； URL 搜索策略； Web 信息抽取 II Abstract With the rapid development of Internet, network resources increase sharply, and information becomes more diversity th

4、an before. At the same time, general search engines are facing a severe challenge. Because the goal of general search engines is information of the entire web, large scale of network information and requirement of high-speed response, which makes the search result cant meet our anticipation. The sub

5、ject-oriented search engine, a new- generation search engine aiming for increasing the correlation degree of searching results, makes improvement of the internet searching service with better division, more data, and higher speed. The research of information collection and searching strategy of the

6、subject-oriented search engine plays a great role in the development of the subject-oriented search engines. In this paper, after a comprehensive overview of the evolution and development of search engines, we compared the performance of traditional and subject-oriented search engines. After that, s

7、ubject-oriented spider is raised as the most important part of search engines. We analyzed basic structure and working principle about it. After introducing the classic network page similarity algorithms, we evaluate them. In the aspect of URL searching strategy, we found an improvement and discusse

8、d about it. At the same time, we analyzed the strategy of page information extraction based on HTML. The paper does research mainly in following four aspects: Firstly, we studied the working principle, functional modules and basic technology of subject-oriented crawler. Secondly, we discussed the cl

9、assic page similarity algorithms, both link-based and content-based. We also discussed about the algorithm of URL searching. Thirdly, we researched information extraction based on HTML web pages strategy. Fourthly, implement the interface of this system. Key words: Web Crawler; URL search strategy;

10、Web information extraction III 目录第一章绪论 . 1 1.1 选题背景和研究意义 . 1 1.2 搜索引擎的发展历程 . 2 1.3 国内外研究现状 . 4 1.4 文本的主要工作和论文结构 . 6 第二章背景知识 . 8 2.1 搜索引擎的分类 . 8 2.1.1 通用搜索引擎 . 8 2.1.2 主题搜索引擎 . 9 2.1.3 搜索引擎按工作方式的分类 . 10 2.2 网络爬虫 . 12 2.2.1 网络爬虫的概念 . 12 2.2.2 网络爬虫工作原理 . 13 2.2.3 网络爬虫的主要技术问题 . 15 2.3 搜索引擎的评价标准 . 1

11、7 2.4 面向主题的信息提取 . 17 2.4.1 面向主题的信息提取分类 . 18 2.4.2 面向主题的 Web 信息提取的优点 . 19 2.5 主题页面在 Web 上的分布特征 . 20 2.5.1 Hub 特征 . 20 2.5.2 主题关联特性 . 21 2.5.3 站点主题特征 . 21 2.5.4 Tunnel 特征 . 21 2.6 本章小结 . 22 第三章主题爬虫搜索策略研究 . 23 3.1 主题爬虫 URL 搜索 . 23 3.1.1 广度优先搜索 . 24 3.1.2 深度优先搜索 . 24 IV 3.1.3 最佳优先搜索 . 24 3.1.4 改进的搜索策略

12、. 24 3.2 主题页面相关度判断 . 26 3.2.1 基于链接的相关性算法 . 27 3.2.2 基于内容的相关性算法 . 29 3.3 本章小结 . 30 第四章网页信息抽取的研究与实现 . 31 4.1 网页信息抽取的研究 . 31 4.1.1 网页主题信息抽取概述 . 31 4.1.2 网页主题信息抽取目标 . 31 4.2 网页主题信息抽取的实现 . 33 4.2.1 基于分块的网页主题信息抽取流程 . 33 4.2.2 网页 HTML 标签文档清洗 . 33 4.2.3 基于容器标签的粗粒度划分 . 34 4.2.4 分块器页面结构分析 . 38 4.2.5 页面的超链接

13、提取 . 38 4.3 界面设计 . 39 4.4 本章小结 . 40 第五章系统展示 . 41 第六章总结 . 44 参考文献 . 45 致谢 . 46 V Contents Chapter 1 Introduction . 1 1.1 Background of the topics and research significance. 1 1.2 History of the development of search engines . 2 1.3 Research status at home and abroad. 4 1.4 Main work and structure o

14、f this paper . 6 Chapter 2 Background knowledge . 8 2.1 Categories of search engines . 8 2.1.1 General search engines. 8 2.1.2 Subject-oriented search engines. 9 2.1.3 Distributed according to work . 10 2.2 Crawler . 12 2.2.1 The concept of crawler. 12 2.2.2 Working principle of crawler . 13 2.2.3 M

15、ain technologies of crawler. 15 2.3 Evaluation standard of search engines. 17 2.4 Information extraction of subject-oriented web . 17 2.4.1 Categories of extraction of subject-oriented web . 18 2.4.2 Advantages of extraction of subject-oriented web . 19 2.5 Distribution features of subject-oriented

16、page on web . 20 2.5.1 Hub feature. 20 2.5.2 Themes related feature . 21 2.5.3 Site theme feature. 21 2.5.4 Tunnel feature . 21 2.6 Summary of this chapter . 22 Chapter 3 Research of subject-oriented searching strategy. 23 3.1 URL searching of subject-oriented crawler . 23 3.1.1 Breadth-first search

17、ing . 24 3.1.2 Depth-first searching. 24 3.1.3 Heuristic-first searching . 24 3.1.4 Improved searching strategy . 24 3.2 Relevance judgments of subject-oriented page . 26 3.2.1 Link-based relevance algorithm. 27 VI 3.2.2 Content-based relevance algorithm. 29 3.3 Summary of this chapter . 30 Chapter

18、4 Research and implementation of information extraction . 31 4.1 Research of information extraction . 31 4.1.1 Summarize of subject information extraction . 31 4.1.2 Target of subject information extraction . 31 4.2 Implementation of subject information extraction . 33 4.2.1 Flow of partition-based

19、subject information extraction . 33 4.2.2 Cleaning of HTML document . 33 4.2.3 Vessel-based coarse partition . 34 4.2.4 Analyze the structure of page. 38 4.2.5 Extraction of Hyperlink . 38 4.3 Interface design . 39 4.4 Summary of this chapter . 40 Chapter 5 System displaying . 41 Chapter 6 Summary.

20、44 References . 45 Gratitude . 46 网页信息抽取 1 第一章绪论在进入正式的研究和分析之前，我们首先要明确此课题的研究状况和研究背景，这样有助于我们明确研究方向。 1.1 选题背景和研究意义随着 Internet 的迅速发展，网络上各个领域的数据和信息也急剧增加。自上世纪九十年代 www 和浏览器出现以来，网络的规模就以惊人的速度增长。据美国因特网检测公司“网器” (Netcraft)28 日宣布，截止 2008 年 2 月底，全球互联网网站数量超过 1.6 亿，较一个月前增加了 450 万 1。网页数量也是以几何级增长，达到百亿级别。

21、2009 年初，中国网民数已跃居世界之首，将近 3 亿。而在 2008 年初，中国也只有大概 140 万的网民。同时，随着手机相关技术的发展，越来越多的手机用户也加入了网民的行列。在 2008 年 6 月的 CHIP 的调查中，有 73.3%的网民认为，通用搜索引擎搜索结果重复率高。另外，还有 48.3%的网民认为专业 /行业搜索功能差，有 49.1%的网民认为信息更新速度慢。以上均是通用搜索引擎的劣势之处。通用搜索引擎是面向大众的一种搜索引擎，其目标是尽可能大的网络覆盖率，但是，随着 Web 上各种信息的爆炸增长，有限的搜索引擎服务器资源与无限的网络数据资源之间的矛盾更加凸显。即使是 Google 这种大型通用搜索引擎，它的平均资源覆盖率也只有 76%(截止 2005 年1 月 )。如图 1-1 所示为各大型搜索引擎的 Web 资源覆盖率。同时，它们还将面临着对庞大的索引数据库的信息刷新问题。图 1-1 通用搜索引擎的资源覆盖率网页信息抽取 2 面对通用搜索引擎的种种问题以及对用户的需求的进一步挖掘，另一种形式的搜索引擎应运而生，它可以在较小的范围内

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？