1、Mining Quality Phrases from Massive Text CorporaAuthors: Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei Han Conference: SIGMOD15, May 31June 4, 2015, Melbourne, Victoria, AustraliaReporter: Liu Shan1content1. Introduction2. Main idea2.1 frequent phrase detection2.2 phrase quality estimation2.3
2、 Phrasal segmentation2.4 Feedback as segmentation features3. Experiment 4. Conclusion21. Introduction MotivationUsing the raw frequency above, all heuristics will produce identical predictions for relational database and vector machine, guaranteeing one of them wrong.Propose a rectification to estim
3、ate how many times each word sequence should be interpreted in whole as a phrase in its occurrence context.3Fact, human judgeFrom rectified to judge whether its a phrase 2. Main idea4Phrase qualityRectified frequencyTwo more featuresestimatecreateinfluence2. Main idea1. Generate frequent phrase cand
4、idates (Sec. 2.1). 2. Estimate phrase quality (Sec. 2.2). 3. Estimate rectified frequency via phrasal segmentation (Sec. 2.3). 4. Add segmentation-based features into the feature set of phrase quality classifier (Sec. 2.4). Repeat step 2 and 3. 5. Filter phrases with low rectified frequencies.5Preli
5、minaries To quantify phrase quality based on four requirements: Popularity: Quality phrases should occur with sufficient frequency in a given document collection. Concordance: Refers to the collocation of tokens in such frequency that is significantly higher than what is expected due to chance. E.g.
6、 strong tea v.s powerful tea. Informativeness: If the phrase is indicative of a specific topic. Completeness: A complete phrase should be interpreted as a whole semantic unit in certain context.62.1 frequent phrase detection7To filter infrequent phrasesTo concatenate with the next wordAccording to p
7、opularity requirement2.2 phrase quality estimation 8Random forestBad quality ratio Good quality ratio (what we need)2.2 phrase quality estimation 92.2 phrase quality estimation InformativenessFeature 3. Phrases that begin or end with stopwords, such as I am, are often functional rather than informative. Feature 4. Average inverse document frequency (IDF) computed over words: Feature 5.Punctuation: a phrase in quotes, brackets or capitalized deserves a higher probability.10