王大玲数据挖掘课件-6.ppt

上传人:99****p 文档编号:1420481 上传时间:2019-02-25 格式:PPT 页数:90 大小:3.65MB
下载 相关 举报
王大玲数据挖掘课件-6.ppt_第1页
第1页 / 共90页
王大玲数据挖掘课件-6.ppt_第2页
第2页 / 共90页
王大玲数据挖掘课件-6.ppt_第3页
第3页 / 共90页
王大玲数据挖掘课件-6.ppt_第4页
第4页 / 共90页
王大玲数据挖掘课件-6.ppt_第5页
第5页 / 共90页
点击查看更多>>
资源描述

1、*290Data MiningKnowledge DiscoveryClassification Basic ConceptsDecision Tree InductionBayes Classification MethodsSupport Vector MachinesOther Classification MethodsSummary*390Data MiningKnowledge Discovery Classification predicts categorical class labels Prediction models continuous-valued function

2、s, i.e. predicts numerical valuesin some machine learning literature, both predicting categorical class labels and modeling continuous-valued functions are called prediction, where the former is called classification and the latter is called regression estimationClassification vs. Prediction*490Data

3、 MiningKnowledge DiscoveryClassification vs. PredictionPredictive mining tasks条件属性、分类属性决策属性、类标签Attr1 Attr2 Attr3 Classa11 a21 a31 c1a11 a22 a32 c1a12 a21 a33 c2a13 a23 a34 c3a12 a24 a32 c2a13 a22 a33 c1modelingClassifi-cationModelAttr1, Attr2,Attr3, , c?Modeling stepsdatasetTraining setTest setmodel

4、ingtest model*590Data MiningKnowledge Discoveryhere prediction covers both classification and regression estimation Step 1: Construct a model to describe a training set the set of tuples used for model construction is called training set data tuples are also called instances, samples, examples, etc.

5、Two step process of prediction (I)Training SetName Rank Year TenuredMike Assistant Prof. 3 noMary Assistant Prof. 7 yesBill Professor 2 yesJim Assistant Prof. 7 yesDave Assistant Prof. 6 noAnne Assistant Prof. 3 noPrediction Algorithmprediction modelIf Rank=professor and year6Then Tenured=yes *690Da

6、ta MiningKnowledge Discovery Step 2: Use the model to predict unseen instancebefore use the model, we can estimate the accuracy of the model by a test set test set is independent of training set the expected output of a test instance is compared with the actual output from the model for classificati

7、on, the accuracy is usually measured by the percentage of test instances that are correctly classified by the model for regression estimation, the accuracy is usually measured by mean squared errorTwo step process of prediction (II)Name Rank Year TenuredTom Assistant Prof. 2 noJoin Assistant Prof. 7

8、 noGeor Professor 5 yesJosep Assistant Prof. 7 yesTest Setprediction modelaccuracyunseen data(Jeff, professor, 7)*790Data MiningKnowledge Discovery Supervised learning the training data are accompanied by labels indicating the class of the observations unseen data is classified according to the pred

9、etermined classesSupervised vs. Unsupervised learning Unsupervised learningthe class labels of training data is unknowngiven a set of measurements, observations, etc., establish the existence of classes or clusters in the data*890Data MiningKnowledge Discovery Data cleaning preprocess data in order

10、to handle noise or missing values although most prediction algorithms have some mechanisms for handling noise or missing values, this step can help reduce confusion during learning Relevance analysis remove the irrelevant or redundant attributes in machine learning, this step is called feature selec

11、tion Data transformation generalize and/or normalize data without normalization, attributes with large values may outweigh attributes with smaller valuesWhat should precede prediction?*990Data MiningKnowledge Discovery Predictive accuracy the ability of the model to correctly predict unseen instance

12、s Speed the computational cost involved in generating and using the model training time cost vs. test time costusually, the bigger the training time cost, the smaller the test time cost and the more accurate the trained model Robustness the ability of the model to deal with noise or missing values S

13、calability the ability of the model to deal with huge volume of data Interpretability the level of comprehensibility of the modelHow to evaluate prediction algorithms?*1090Data MiningKnowledge Discoverytwo popular methods: hold-outpartition the data set into two independent subsets, i.e. a training

14、set and a test setusually 2/3 of the data set are used for training while the rest 1/3 are used for test hold-out with random subsampling: repeat hold-out test for k times k-fold cross-validationpartition the data set into k mutually exclusive subsets with approximately equal size. Perform training and test for k times. In i-th time, the i-th subset is used for test while the rest subsets are collectively used for training10-fold cross-validation is often usedHow to estimate accuracy?dataset123123trainingtesting132trainingtesting231trainingtesting

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 教育教学资料库 > 课件讲义

Copyright © 2018-2021 Wenke99.com All rights reserved

工信部备案号浙ICP备20026746号-2  

公安局备案号:浙公网安备33038302330469号

本站为C2C交文档易平台,即用户上传的文档直接卖给下载用户,本站只是网络服务中间平台,所有原创文档下载所得归上传人所有,若您发现上传作品侵犯了您的权利,请立刻联系网站客服并提供证据,平台将在3个工作日内予以改正。