基于Web的数据挖掘方法综述.doc_文客久久网wenke99.com

资源描述

1、 On the Named Entity based Relation Extraction and Event Supported Web Page RepresentationDissertation Submitted toPeking Universityin partial fulfillment of the requirement for the degree of Doctor of Philosophy in ScienceDi Nan(Computer Science and Technology)Dissertation Supervisor: Professor Xia

2、oming LiMAY, 2010版权声明任何收存和保管本论文各种版本的单位和个人，未经本论文作者同意，不得将本论文转借他人并复制、抄录、拍照、或以任何方式传播。否则，引起有碍作者著作权益之问题，将可能承担法律责任。基于实体上下文的实体关系发现与支持事件发现的网页文档表示研究摘要：命名实体是现实社会中一个具体的事物，而在 Web 上的网页文本中有大量的有关命名实体的内容。这些内容中有些是描述实体的静态属性信息，例如实体的属性，实体之间的关系。描述实体静态信息的文本一般为较短的包含实体的单个句子，而且其信息内容不会随时间产生明显变化。而另一些包含实体的文本内容描述的是实体的动态信息，主要是

3、描述实体参与到新闻事件中，并且在新闻事件中的行为。包含后一种实体信息的内容较之前一种内容要更长，一般包含若干句子构成一段内容相对统一的文本子段，而且其内容信息也会随时间变化而发生明显变化。本文希望通过采用对网页文本中包含实体的上下文进行分析挖掘这一基本方法，分别解决利用实体共现文本发现实体间关系的问题和以实体为核心的事件发现与追踪的问题。概括而言，本文在这两个研究问题上的主要贡献包括：（一）Web 实体关系实例的提取实体关系在网页文本中的一种重要体现形式，是处于特定关系的实体对共同出现在一段描述这种关系的文本中。这里我们定义这段描述特定实体关系的文本为 web 实体关系实例。能否提取足够数量

4、与较高质量的 Web 实体关系实例是能否有效地发现实体间关系的重要前提工作。已有的使用命名实体上下文来发现命名实体之间关系的工作，一般是直接使用包含命名实体对的句子作为表示关系对的特征。这种做法存在两个明显的问题：其一，在海量网页文本中包含命名实体共现的句子除了描述实体间关系的 Web 实体关系实例之外，还可能是描述两个实体同时参与到一个事件这样的动态特征。其二，由于描述实体关系的句子长度比传统文本分类语料的长度要小很多，即使描述相同类型实体关系的文本也可能在词汇特征上有很大差异。本文中的实验也验证了直接使用网页文本中实体上下文作为实体关系对特征会对实体关系发现带来的负面影响。基于实体上下文的

5、实体关系发现与支持事件发现的网页文档表示研究I因此，我们提出了描述实体关系的实体上下文的筛选与扩充这两个新的研究问题。在此基础上，本文提出了一种有效的方法来解决上述两个问题。该方法首先利用对 Wikipedia 和百度百科中描述实体关系的文本进行分析挖掘，学习得到描述实体关系的语言模型，利用贝叶斯公式计算实体共现句子包含实体关系信息的概率并依此对共现句子进行筛选；其次，该方法利用搜索引擎作为媒介，通过将描述实体关系的实体共现文本作为查询投送到搜索引擎，得到在互联网中其他描述同一实体关系对的文本作为描述该实体对的文本的扩充，通过迭代执行实体对上下文的筛选与扩充，最终得到表示每个实体对的文本特征。

6、（二）基于图扩散的实体关系类型标注方法通过上述工作，我们将每个实体关系对表示为该实体对在网页中的 web 实体关系实例，下一步的工作是对这些 web 实体关系实例标注其所描述的关系类型。在本文的工作中，我们使用 NIST 举办的 ACE（Automatic Content Extraction）评测中提供的命名实体关系分类体系，这一体系包括三大类实体间关系：人物-人物间关系、人物-机构间关系、人物-地点间关系，以及 11 个具体关系子类。为了确定每个实体关系对的关系类型，前期研究者所使用的实体关系类型标注方法主要是以一定数量的、有标注类别的、描述各类实体关系文本作为训练集，通过传统的有监督学

7、习方法，例如 kNN，SVM 等，将待标注类型的实体对的共现文本分类到特定的实体关系类别，并依此来标注实体对的关系类型。由于我们要处理的网页文本的特点是数据量巨大，语言规范性不强，因此在网页文本数据上人工标注、或者自动获取一个足够数量与较高质量的描述各类实体关系的训练集都是一项困难的工作。因此，我们提出了一种新的利用图扩散的实体关系标注方法，所需的人工干预只是对每种关系类别标注少数几个实体对。该方法以共现的实体对作为顶点，以实体对上下文之间的相似度来建立边，构建一个无向加权图，并利用边上的权重将少数几个已标注类别顶点的类别信息，通过半监督学习的迭代扩散方法将少数已标注结点的类别信息在整个图中进

8、行传播，在传播到达平衡的时候获得图中顶点即所有实体对的关系类型信息。通过实验，该方法在标注数据量很少的情况基于实体上下文的实体关系发现与支持事件发现的网页文档表示研究II下效果要明显好于已有研究中使用的有监督学习的方法，实验结果也表明利用这种方法所得的实体关系类型结果并不依赖于前期标注的实体对的集合与数量。（三）基于多维网页文档特征的新闻网页表示模型事件是一个可观察、非平凡的现象，一个事件包含的元素可以是事件发生时间、事件发生地点，事件经过和参与事件的实体等重要信息。事件在网页中的反映是新闻网页，新闻网页较之传统的新闻媒体（如报纸、广播等）只包含正文特征，新闻网页具有更多有利于新闻事件发现的特

9、征信息，例如网页的 URL、网页的时间、网页中出现的命名实体等。本文的实验结果显示这些特征与其所在网页叙述的新闻事件都有很强的相关性，同时这些特征也可以帮助判断两篇新闻网页是否描述同一新闻事件。因此,是否能够提出一种有效的新闻网页标识模型已经是基于新闻网页的事件发现研究中的一个重要、热点研究问题。前期研究者的工作已经使用了上述中的一些新闻网页中特有的特征信息，例如网页的时间、网页正文中的命名实体等。但需要指出的是，这些工作只是以向量空间模型（Vector Space Model）为基础，利用新闻网页中的这些特征为网页正文表示模型进行修改。本文提出一种新的基于网页中多维特征的新闻网页表示模型，在

10、此模型中可以任意添加上述与新闻事件相关的各种网页特征信息，各种维度信息在模型中的表示形式与相互关系是独立的。同时，为了准确测量在此模型下不同新闻网页之间的相似度，我们提出一种使用了支持向量机（Supporting Vector Machine）将网页各维度特征的相似度综合的方法。在此方法下，不同网页特征对新闻网页相似度的影响力度可以自动的通过训练学习得到，而不同于已有工作中人工设定各种特征对网页相似度的贡献比例。通过使用来自实际 Web 中的中文、英文两组不同新闻网页作为数据，我们在实验中使用了新闻网页的时间、正文中出现的命名实体、网页文本正文、网页中相关新闻链接和网页中的读者评论等特征信息来

11、表示一篇新闻网页。实验结果表明，在利用网页中多维特征的网页表示模型下新闻事件发现的效果要明显好于传统上只使用网页正文特征的方法。基于实体上下文的实体关系发现与支持事件发现的网页文档表示研究III（四）以命名实体为核心的正文分段模型在新闻网页中与新闻事件内容相关的各类特征中，文档的正文特征仍然是最重要的部分。如何对文档正文进行有效的建模将对新闻事件发现的效果有很大影响。新闻文档的叙述方式是围绕参与到事件中的命名实体为核心进行叙述的，文档中命名实体的上下文为描述该实体的属性或者该实体在事件中的行为。因此在本文中我们对新闻文档正文提出并验证了两个假设：1. 新闻文档正文中，命名实体上下文包含的有关新

12、闻事件的信息要高于正文其它部分所包含的信息。2. 新闻文档正文中，不同的实体上下文分别描述事件的不同侧面的信息，例如描述事件背景、事件进展和对事件的评论等。报道相同事件新闻文档中叙述相同类型的实体上下文具有更高的文本相似度。根据这两个假设，对新闻网页文本的建模可以划分为两个子问题：以命名实体为核心的正文分段问题和文本子段的分类与排序问题。对第一个子问题，本文分别提出了基于句间相似度的分段方法、基于子段间互信息的分段方法和基于文档子段对齐等三种正文分段方法。对第二个子问题，本文根据子段在正文中的位置，其中包含的实体、实体类型和子段内容对子段进行分类，并且通过子段间相似度引入了两个文本子段重要性指

13、标：子段的一般性得分和子段的新颖性得分来评价一个子段的重要性。事件发现的实验结果也显示对正文采用以实体为核心的分段模型表示的方法要优于使用传统的单一正文特征向量的方法。关键字：网页内容挖掘，实体关系发现，事件发现与追踪基于实体上下文的实体关系发现与支持事件发现的网页文档表示研究IVAbstract：Named Entity is the name of some specific object in the real life, and there is planet of information lies inside the text of web pages. On one side,

14、some of this information is to describe the static entity attributes, such as the relations between two entities. The context describes the static attributes is generally short, such as a sentence that contains the entities, and its content will not produce changes over a period of time. On the othe

15、r side, the named entity context also contains dynamic information of the entities, which is to describe the news event that the entities involve and the entities behavious in the event. The context describes the dynamic information is much longer than the previous static ones. It may contain severa

16、l sentences and forms a topic-independent paragraph, and the content will change over time. In this dissertation, I want to employ the mining of context surround the named entities in the web pages as the basic method, and sovle the named entity relation discovery problem and the news document repre

17、sentation problem respectively. In summary, the contributions of this dissertation are in the following five parts:(1) Extraction of web entity relation instance Named entity relation is represented in the web page content mainly by the co-occurrence of two named entities in the same sentence. We de

18、fine such sentence as the web entity relation instance of the relation entity pair. Extract a sufficient number of web entity relation instance with good quality is one of the main challenges in the entity relation discovery problem.Existing resolution to this problem is to use all the sentences wit

19、h entity co-occurrence as the web entity relation instance. This could introduce two obvious faults. One is that not all the entity co-occurrence describes the relation information of these two entities. It could be the description of the two entities involve in some news event. Another problem is t

20、he length of the sentence that contains entity relation information 基于实体上下文的实体关系发现与支持事件发现的网页文档表示研究Vis much shorter than the traditional classification corpus, even worse the descriptions of the same entity relation may be varied in different context. Our experiments also varify the drawbacks of dire

21、ctly using entity co-occurrence in the entity relation discovery.Under this circumstance, I introduce a new research problem of web entity relation instance classification. I also propose two methods to solve this problem. The first one is to take advantage of the unique term distribution of the lan

22、guage that describes the entity relation. It learns a entity relation language model from the training corpus of Wikipedia and Baidu Baike. And use the Bayesian theorem to calculate probability of a sentence to be the web entity relation instance. Another method is to expand the web entity relation

23、instance by adding the search result of particular keywords to the search engines. By repeating this two steps iteratively, we can finally acquire enouph context to represent a web entity relation instance.(2) Graph based entity relation classificationBy using the above method, the entity pair is re

24、presented by the web entity relation instance of the context surround the entities. The following problem is to label each entity pair with specific relation types. In this dissertation, the relation category introduced by ACE (Automatic Content Extraction) is employed. Three main relation categorie

25、s are Person-Person relation, Person-Location relation and Person-Organization relation. It also contains twelve specific relation types, such as the family relation of Person-Person, employment relation of Person-Organization.To label each web entity relation instance. I do not use the supervised l

26、earning method such as kNN and SVM. These methods need a labeled training corpus for each category with enough quantity and quality. While this corpus cannot be easily acquired for the web entity relation instances, so we introduce a graph base method, it constructs a graph with web entity relation instances as its nodes and their similarity build the edges. This method only needs a small number of web entity relation instances to be

展开阅读全文