文本挖掘.pptx

上传人:99****p 文档编号:1420479 上传时间:2019-02-25 格式:PPTX 页数:28 大小:1.05MB
下载 相关 举报
文本挖掘.pptx_第1页
第1页 / 共28页
文本挖掘.pptx_第2页
第2页 / 共28页
文本挖掘.pptx_第3页
第3页 / 共28页
文本挖掘.pptx_第4页
第4页 / 共28页
文本挖掘.pptx_第5页
第5页 / 共28页
点击查看更多>>
资源描述

1、Mining Quality Phrases from Massive Text CorporaAuthors: Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei Han Conference: SIGMOD15, May 31June 4, 2015, Melbourne, Victoria, AustraliaReporter: Liu Shan1content1. Introduction2. Main idea2.1 frequent phrase detection2.2 phrase quality estimation2.3

2、 Phrasal segmentation2.4 Feedback as segmentation features3. Experiment 4. Conclusion21. Introduction MotivationUsing the raw frequency above, all heuristics will produce identical predictions for relational database and vector machine, guaranteeing one of them wrong.Propose a rectification to estim

3、ate how many times each word sequence should be interpreted in whole as a phrase in its occurrence context.3Fact, human judgeFrom rectified to judge whether its a phrase 2. Main idea4Phrase qualityRectified frequencyTwo more featuresestimatecreateinfluence2. Main idea1. Generate frequent phrase cand

4、idates (Sec. 2.1). 2. Estimate phrase quality (Sec. 2.2). 3. Estimate rectified frequency via phrasal segmentation (Sec. 2.3). 4. Add segmentation-based features into the feature set of phrase quality classifier (Sec. 2.4). Repeat step 2 and 3. 5. Filter phrases with low rectified frequencies.5Preli

5、minaries To quantify phrase quality based on four requirements: Popularity: Quality phrases should occur with sufficient frequency in a given document collection. Concordance: Refers to the collocation of tokens in such frequency that is significantly higher than what is expected due to chance. E.g.

6、 strong tea v.s powerful tea. Informativeness: If the phrase is indicative of a specific topic. Completeness: A complete phrase should be interpreted as a whole semantic unit in certain context.62.1 frequent phrase detection7To filter infrequent phrasesTo concatenate with the next wordAccording to p

7、opularity requirement2.2 phrase quality estimation 8Random forestBad quality ratio Good quality ratio (what we need)2.2 phrase quality estimation 92.2 phrase quality estimation InformativenessFeature 3. Phrases that begin or end with stopwords, such as I am, are often functional rather than informative. Feature 4. Average inverse document frequency (IDF) computed over words: Feature 5.Punctuation: a phrase in quotes, brackets or capitalized deserves a higher probability.10

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 教育教学资料库 > 课件讲义

Copyright © 2018-2021 Wenke99.com All rights reserved

工信部备案号浙ICP备20026746号-2  

公安局备案号:浙公网安备33038302330469号

本站为C2C交文档易平台,即用户上传的文档直接卖给下载用户,本站只是网络服务中间平台,所有原创文档下载所得归上传人所有,若您发现上传作品侵犯了您的权利,请立刻联系网站客服并提供证据,平台将在3个工作日内予以改正。