机器学习题库.doc_文客久久网wenke99.com

资源描述

1、机器学习题库一、极大似然1、 ML estimation of exponential model (10)A Gaussian distribution is often used to model data on the real line, but is sometimes inappropriate when the data are often close to zero but constrained to be nonnegative. In such cases one can fit an exponential distribution, whose probabilit

2、y density function is given by1xbpeGiven N observations xi drawn from such a distribution:(a) Write down the likelihood as a function of the scale parameter b.(b) Write down the derivative of the log likelihood.(c) Give a simple expression for the ML estimate for b.2、换成 Poisson 分布： |,012,.!xepy11log

3、|logl!NNiii ii ixx二、贝叶斯1、贝叶斯公式应用假设在考试的多项选择中，考生知道正确答案的概率为 p，猜测答案的概率为 1-p，并且假设考生知道正确答案答对题的概率为 1，猜中正确答案的概率为，其中 m 为多1选项的数目。那么已知考生答对题目，求他知道正确答案的概率。： ,|pknowcretppknocret m2、 Conjugate priorsGiven a likelihood for a class models with parameters , a conjugate prior is a |xdistribution with hyperparamete

4、rs , such that the posterior distribution|p|, |pX与先验的分布族相同(a) Suppose that the likelihood is given by the exponential distribution with rate parameter :|xpeShow that the gamma distribution_1|,Gameis a conjugate prior for the exponential. Derive the parameter update given observations and 1,Nxthe pre

5、diction distribution .1|,Npx(b) Show that the beta distribution is a conjugate prior for the geometric distribution1|kpxwhich describes the number of time a coin is tossed until the first heads appears, when the probability of heads on each toss is . Derive the parameter update rule and prediction d

6、istribution.(c) Suppose is a conjugate prior for the likelihood ; show that the mixture prior|p |px11|,.|Mmwis also conjugate for the same likelihood, assuming the mixture weights wm sum to 1. (d) Repeat part (c) for the case where the prior is a single distribution and the likelihood is a mixture,

7、and the prior is conjugate for each mixture component of the likelihood.some priors can be conjugate for several different likelihoods; for example, the beta is conjugate for the Bernoulliand the geometric distributions and the gamma is conjugate for the exponential and for the gamma with fixed (e)

8、(Extra credit, 20) Explore the case where the likelihood is a mixture with fixed components and unknown weights; i.e., the weights are the parameters to be learned.三、判断题（1）给定 n 个数据点，如果其中一半用于训练，另一半用于测试，则训练误差和测试误差之间的差别会随着 n 的增加而减小。（2）极大似然估计是无偏估计且在所有的无偏估计中方差最小，所以极大似然估计的风险最小。（）回归函数 A 和 B，如果 A 比 B 更简单，则

9、 A 几乎一定会比 B 在测试集上表现更好。（）全局线性回归需要利用全部样本点来预测新输入的对应输出值，而局部线性回归只需利用查询点附近的样本来预测输出值。所以全局线性回归比局部线性回归计算代价更高。（）Boosting 和 Bagging 都是组合多个分类器投票的方法，二者都是根据单个分类器的正确率决定其权重。() In the boosting iterations, the training error of each new decision stump and the training error of the combined classifier vary roughly in

10、concert （F）While the training error of the combined classifier typically decreases as a function of boosting iterations, the error of the individual decision stumps typically increases since the example weights become concentrated at the most difficult examples.() One advantage of Boosting is that i

11、t does not overfit. （F）() Support vector machines are resistant to outliers, i.e., very noisy examples drawn from a different distribution. （）（9）在回归分析中，最佳子集选择可以做特征选择，当特征数目较多时计算量大；岭回归和Lasso 模型计算量小，且 Lasso 也可以实现特征选择。（10）当训练数据较少时更容易发生过拟合。（11）梯度下降有时会陷于局部极小值，但 EM 算法不会。（12）在核回归中，最影响回归的过拟合性和欠拟合之间平衡的参数为核函数的

12、宽度。(13) In the AdaBoost algorithm, the weights on all the misclassified points will go up by the same multiplicative factor. （T ）(14) True/False: In a least-squares linear regression problem, adding an L2 regularization penalty cannot decrease the L2 error of the solution w on the training data. （F）

13、(15) True/False: In a least-squares linear regression problem, adding an L2 regularization penalty always decreases the expected L2 error of the solution w on unseen test data （F）.(16)除了 EM 算法，梯度下降也可求混合高斯模型的参数。 (T)(20) Any decision boundary that we get from a generative model with class-conditional

14、Gaussian distributions could in principle be reproduced with an SVM and a polynomial kernel.True! In fact, since class-conditional Gaussians always yield quadratic decision boundaries, they can be reproduced with an SVM with kernel of degree less than or equal to two.(21) AdaBoost will eventually re

15、ach zero training error, regardless of the type of weak classifier it uses, provided enough weak classifiers have been combined.False! If the data is not separable by a linear combination of the weak classifiers, AdaBoost cant achieve zero training error.(22) The L2 penalty in a ridge regression is

16、equivalent to a Laplace prior on the weights. （F）(23) The log-likelihood of the data will always increase through successive iterations of the expectation maximation algorithm. (F)(24) In training a logistic regression model by maximizing the likelihood of the labels given the inputs we have multipl

17、e locally optimal solutions. (F)四、回归1、考虑回归一个正则化回归问题。在下图中给出了惩罚函数为二次正则函数，当正则化参数C取不同值时，在训练集和测试集上的log似然（mean log-probability）。（10分）（1）说法“随着C的增加，图2中训练集上的log似然永远不会增加”是否正确，并说明理由。（2）解释当C取较大值时，图2中测试集上的log似然下降的原因。2、考虑线性回归模型：，训练数据如下图所示。（10 分）201, yNwx（1）用极大似然估计参数，并在图（a）中画出模型。（3 分）（2）用正则化的极大似然估计参数，即在 log

18、似然目标函数中加入正则惩罚函数，21Cw并在图（b）中画出当参数 C 取很大值时的模型。（3 分）（3）在正则化后，高斯分布的方差是变大了、变小了还是不变？（4 分）2图(a) 图(b)3. 考虑二维输入空间点上的回归问题，其中在单位正方形内。12,Tx1,2jxj训练样本和测试样本在单位正方形中均匀分布，输出模型为，我们用 1-10 阶多项式特征，采用线性回归模型来3521210753, yNx学习 x 与 y 之间的关系（高阶特征模型包含所有低阶特征），损失函数取平方误差损失。(1) 现在个样本上，训练 1 阶、2 阶、8 阶和 10 阶特征的模型，然后在一个大规模的独n立的

19、测试集上测试，则在下 3 列中选择合适的模型（可能有多个选项），并解释第 3 列中你选择的模型为什么测试误差小。（10 分）训练误差最小训练误差最大测试误差最小1 阶特征的线性模型 X2 阶特征的线性模型 X8 阶特征的线性模型 X10 阶特征的线性模型 X(2) 现在个样本上，训练 1 阶、2 阶、8 阶和 10 阶特征的模型，然后在一个大规模的独610n立的测试集上测试，则在下 3 列中选择合适的模型（可能有多个选项），并解释第 3 列中你选择的模型为什么测试误差小。（10 分）训练误差最小训练误差最大测试误差最小1 阶特征的线性模型 X2 阶特征的线性模型8 阶特征的线

20、性模型 X X10 阶特征的线性模型 X(3) The approximation error of a polynomial regression model depends on the number of training points. (T)(4) The structural error of a polynomial regression model depends on the number of training points. (F)4、We are trying to learn regression parameters for a dataset which we k

21、now was generated from a polynomial of a certain degree, but we do not know what this degree is. Assume the data was actually generated from a polynomial of degree 5 with some added Gaussian noise (that is .234501 , 0,1ywxxwxNFor training we have 100 x,y pairs and for testing we are using an additio

22、nal set of 100 x,y pairs. Since we do not know the degree of the polynomial we learn two models from the data. Model A learns parameters for a polynomial of degree 4 and model B learns parameters for a polynomial of degree 6. Which of these two models is likely to fit the test data better?Answer: De

23、gree 6 polynomial. Since the model is a degree 5 polynomial and we have enough training data, the model we learn for a six degree polynomial will likely fit a very small coefficient for x6 . Thus, even though it is a six degree polynomial it will actually behave in a very similar way to a fifth degr

24、ee polynomial which is the correct model leading to better fit to the data.5、Input-dependent noise in regressionOrdinary least-squares regression is equivalent to assuming that each data point is generated according to a linear function of the input plus zero-mean, constant-variance Gaussian noise.

25、In many systems, however, the noise variance is itself a positive linear function of the input (which is assumed to be non-negative, i.e., x = 0).a) Which of the following families of probability models correctly describes this situation in the univariate case? (Hint: only one of them does.)(iii) is

26、 correct. In a Gaussian distribution over y, the variance is determined by the coefficient of y2; so by replacing by , we get a variance that increases linearly with x. (Note also the change to the 2xnormalization “constant.”) (i) has quadratic dependence on x; (ii) does not change the variance at a

27、ll, it just renames w1.b) Circle the plots in Figure 1 that could plausibly have been generated by some instance of the model family(ies) you chose.(ii) and (iii). (Note that (iii) works for .) (i) exhibits a large variance at x = 0, and the 20variance appears independent of x.c) True/False: Regress

28、ion with input-dependent noise gives the same solution as ordinary regression for an infinite data set generated according to the corresponding model.True. In both cases the algorithm will recover the true underlying model.d) For the model you chose in part (a), write down the derivative of the nega

29、tive log likelihood with respect to w1.五、分类1. 产生式模型 vs. 判别式模型(a) Your billionaire friend needs your help. She needs to classify job applications into good/bad categories, and also to detect job applicants who lie in their applications using density estimation to detect outliers. To meet these needs

30、, do you recommend using a discriminative or generative classifier? Why? 产生式模型因为要估计密度 |pyx(b) Your billionaire friend also wants to classify software applications to detect bug-prone applications using features of the source code. This pilot project only has a few applications to be used as training

31、 data, though. To create the most accurate classifier, do you recommend using a discriminative or generative classifier? Why?判别式模型样本数较少，通常用判别式模型直接分类效果会好些(d) Finally, your billionaire friend also wants to classify companies to decide which one to acquire. This project has lots of training data based on several decades of research. To create the most accurate classifier, do you recommend using a discriminative or generative classifier? Why?产生式模型样本数很多时，可以学习到正确的产生式模型2、logstic 回归

展开阅读全文