1、*290Data MiningKnowledge DiscoveryClassification Basic ConceptsDecision Tree InductionBayes Classification MethodsSupport Vector MachinesOther Classification MethodsSummary*390Data MiningKnowledge Discovery Classification predicts categorical class labels Prediction models continuous-valued function
2、s, i.e. predicts numerical valuesin some machine learning literature, both predicting categorical class labels and modeling continuous-valued functions are called prediction, where the former is called classification and the latter is called regression estimationClassification vs. Prediction*490Data
3、 MiningKnowledge DiscoveryClassification vs. PredictionPredictive mining tasks条件属性、分类属性决策属性、类标签Attr1 Attr2 Attr3 Classa11 a21 a31 c1a11 a22 a32 c1a12 a21 a33 c2a13 a23 a34 c3a12 a24 a32 c2a13 a22 a33 c1modelingClassifi-cationModelAttr1, Attr2,Attr3, , c?Modeling stepsdatasetTraining setTest setmodel
4、ingtest model*590Data MiningKnowledge Discoveryhere prediction covers both classification and regression estimation Step 1: Construct a model to describe a training set the set of tuples used for model construction is called training set data tuples are also called instances, samples, examples, etc.
5、Two step process of prediction (I)Training SetName Rank Year TenuredMike Assistant Prof. 3 noMary Assistant Prof. 7 yesBill Professor 2 yesJim Assistant Prof. 7 yesDave Assistant Prof. 6 noAnne Assistant Prof. 3 noPrediction Algorithmprediction modelIf Rank=professor and year6Then Tenured=yes *690Da
6、ta MiningKnowledge Discovery Step 2: Use the model to predict unseen instancebefore use the model, we can estimate the accuracy of the model by a test set test set is independent of training set the expected output of a test instance is compared with the actual output from the model for classificati
7、on, the accuracy is usually measured by the percentage of test instances that are correctly classified by the model for regression estimation, the accuracy is usually measured by mean squared errorTwo step process of prediction (II)Name Rank Year TenuredTom Assistant Prof. 2 noJoin Assistant Prof. 7
8、 noGeor Professor 5 yesJosep Assistant Prof. 7 yesTest Setprediction modelaccuracyunseen data(Jeff, professor, 7)*790Data MiningKnowledge Discovery Supervised learning the training data are accompanied by labels indicating the class of the observations unseen data is classified according to the pred
9、etermined classesSupervised vs. Unsupervised learning Unsupervised learningthe class labels of training data is unknowngiven a set of measurements, observations, etc., establish the existence of classes or clusters in the data*890Data MiningKnowledge Discovery Data cleaning preprocess data in order
10、to handle noise or missing values although most prediction algorithms have some mechanisms for handling noise or missing values, this step can help reduce confusion during learning Relevance analysis remove the irrelevant or redundant attributes in machine learning, this step is called feature selec
11、tion Data transformation generalize and/or normalize data without normalization, attributes with large values may outweigh attributes with smaller valuesWhat should precede prediction?*990Data MiningKnowledge Discovery Predictive accuracy the ability of the model to correctly predict unseen instance
12、s Speed the computational cost involved in generating and using the model training time cost vs. test time costusually, the bigger the training time cost, the smaller the test time cost and the more accurate the trained model Robustness the ability of the model to deal with noise or missing values S
13、calability the ability of the model to deal with huge volume of data Interpretability the level of comprehensibility of the modelHow to evaluate prediction algorithms?*1090Data MiningKnowledge Discoverytwo popular methods: hold-outpartition the data set into two independent subsets, i.e. a training
14、set and a test setusually 2/3 of the data set are used for training while the rest 1/3 are used for test hold-out with random subsampling: repeat hold-out test for k times k-fold cross-validationpartition the data set into k mutually exclusive subsets with approximately equal size. Perform training and test for k times. In i-th time, the i-th subset is used for test while the rest subsets are collectively used for training10-fold cross-validation is often usedHow to estimate accuracy?dataset123123trainingtesting132trainingtesting231trainingtesting