1、Web CrawlingBased on the slides by Filippo Menczer Indiana University School of Informatics in Web Data Mining by Bing Liu 1Outline Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Crawler ethics and conflicts2Q: How does a search engine know that all t
2、hese pages contain the query terms? A: Because all of those pages have been crawled3Crawler:basic ideastarting pages(seeds)4Many names Crawler Spider Robot (or bot) Web agent Wanderer, worm, And famous instances: googlebot, scooter, slurp, msnbot, 5Googlebot & you6Motivation for crawlers Support uni
3、versal search engines (Google, Yahoo, MSN/Windows Live, Ask, etc.) Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc. Business intelligence: keep track of potential competitors, partners Monitor Web sites of interest Evil: harvest emails for spamming, phishing
4、 Can you think of some others?7A crawler within a search engine8WebText index PageRankPage repositorygooglebotText & link analysisQueryhitsRankerOne taxonomy of crawlers Many other criteria could be used: Incremental, Interactive, Concurrent, Etc.9Outline Motivation and taxonomy of crawlers Basic crawlers and implementation issues Universal crawlers Crawler ethics and conflicts10