1、Data Mining Techniques 1Review( ) What is data mining? Data mining is the task of discovering interesting patterns from large amounts of data, where the data can be stored in databases, data warehouses, or other information repositories. It is a young interdisciplinary field, drawing from areas such
2、 as database systems, data warehousing, statistics, machine learning, data visualization, information retrieval, and highperformance computing. Other contributing areas include neural networks, pattern recognition, spatial data analysis, image databases, signal processing, and many application field
3、s, such as business, economics, and bioinformatics.Data Mining Techniques 2Review( ) KDD knowledge discovery in databases Data miningcore of knowledge discovery process Preprocessing Data cleaning Data integration Data selection Data transformation Data mining Pattern evaluation Knowledge presentati
4、onData Mining Techniques 3Top 10 Algorithms#1: C4.5 (61 票 ), (判定树或决策树,分类算法 )#2: KMeans (60票 ),( K平均聚类算法)#3: SVM (58票 ),(分类算法)(支持向量机,分类算法)#4: Apriori (52票 ), (关联规则挖掘算法)#5: EM (48票 ),(期望最大化算法,聚类与参数估计)#6: PageRank (46票 ), (著名的 google页面评价算法)#7: AdaBoost (45票 ), (积弱为强的分类算法)#7: kNN (45票 ),(以近邻为楷模的分类方法)#7:
5、 Naive Bayes (45票 ),(基于对象原生态分布的分类算法,不需或少需先验知识)#10: CART (34票 ), (二分递归分割的的判定树分类方法)Data Mining Techniques 4Top 10 Problems#1:数据挖掘的统一理论。十年前,专家看到当时的数据挖掘中急用先研的短期行为较多,为单个问题研究技术,无统一的理论 ,目光不远大 , 至今,比较完整的数据挖掘的同一理论还在探索中;#2:规模伸缩性、高维和高速问题。十年前的数据挖掘技术,在维度增加,数据规模增大时,所需资源(时间、空间和 CPU)指数级地增加,在数据流分析、网络攻防、传感器网络应用中成为瓶颈;
6、如今问题仍然在;#3:时间序列的高效率处理 + 高效分类聚类和预测。如今,在短长期预报,高精度处理方面问题仍然存在;#4:复杂数据中挖掘复杂知识,如图数据挖掘等表现突出,如今,在亚复杂系统干预规则的挖掘中也有需求;#5:网络挖掘,社会网络,邮件,网页,网络反恐,海量数据挖掘等;问题仍然存在;#6:分布式挖掘和多代理挖掘,如大型网络游戏,网络军事对抗等,需求日益增加;#7:生物数据挖掘,艾滋病疫苗相关、 DNA相关的数据挖掘,方兴未艾;#8:数据挖掘自身的方法论研究, 尚待突破;#9:数据挖掘与信息安全和隐私保护;成为目前关注热点;#10:特色数据的挖掘:包括高价值数据(如重症监护室数据),偏斜
7、数据(抽样偏斜失真),不平衡数据(有用的只占很小比例)。Data Mining Techniques 5Mining Frequent Patterns, AssociationsData Mining Techniques 6Outline What is association rule mining and frequent pattern mining? Methods for frequentpattern mining Constraintbased frequentpattern mining Frequentpattern mining: achievements, promi
8、ses and research problemsData Mining Techniques 7Market Basket AnalysisMarketing basket analysis is a process that analyzes customer buying habits by finding associations between the different items that customers place in their shopping baskets.Suppose you are a manager of a supermarket, what would
9、 you like to learn about buying habits of your customers?Data Mining Techniques 8What Market Basket Analysis Can Help? Customer: who they are? why they make certain purchase? Merchandise: which products are likely to be purchased together? Which are most amenable to promotion? Does a brand of produc
10、ts make a difference? Usage: Store layout Items that are frequently purchased together can be placed in near shelf in order to further encourage the sale of such items together Placing items that are frequently purchased together at opposite ends of the store may entice customers who purchase such i
11、tems to pick up other items along the way Product placement Offer correlated products to the customer at the same time Send memory card offer to digital camera purchasers 23 months after digital camera purchase Coupons issue Dont give discounts on 2 items that are frequently bought together. Use the
12、 discount on 1 to “pull” the otherData Mining Techniques 9Association Rules from Market Basket Analysisq Method:Transaction 1: Frozen pizza, cola, milk Transaction 2: Milk, potato chips Transaction 3: Cola, frozen pizza Transaction 4: Milk, pretzels Transaction 5: Cola, pretzels FrozenPizza Milk Col
13、aPotato Chips PretzelsFrozen Pizza 2 1 2 0 0Milk 1 3 1 1 1Cola 2 1 3 0 1Potato Chips 0 1 0 1 0Pretzels 0 1 1 0 2Hints that frozen pizza and cola may sell well together, and should be placed sidebyside in the convenience store. qResults:we could derive the association rules:If a customer purchases Fr
14、ozen Pizza, then they will probably purchase Cola. If a customer purchases Cola, then they will probably purchase Frozen Pizza. Data Mining Techniques 10What are Frequent Patterns? Frequent patterns: product combinations that are frequently purchased together by customers Frequent patterns: patterns
15、 (itemsets, subsequences, substructures, etc.) that occur frequently in a database AIS93For example: A set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a frequent sequential pattern A substructure, can refer to different structural forms, such as subgraph, subtree, or sublattics. If a substructure occurs frequently, it is called a frequent structured pattern.