机器学习-Ch1-3

主要内容：ML概述、KNN、Naive Bayes、Regression、SVM

[TOC]

Ch1 Overview of ML

对ML的分类：

监督学习(Supervised learning) 其基本思想是给定数据集中的样本是带有”正确答案”的,学习这些数据之后,再来新样本时,可以做出预测. 常见的问题有垃圾邮件分类。

无监督学习(Unsupervised learning) 给定数据集没有标签。常用算法：聚类。常见问题：新闻分类、细分市场。

Application:

Supervised learning:
- Linear Classifier
- Support Vector Machines (hard SVM, soft SVM)
- Kernel Methods
- Deep Learning
Unsupervised learning:
- Linear Discriminant Analysis(LDA)
- Principle Component Analysis (PCA)
- Generative Models(e.g. GMM, K-means)

Supervised learning

一些参数以及符号：

Model ， Loss Function ，模型中的参数，输入输出

The objective of the supervised learning is to find the parameters that minimises the average loss:

$\min_{\theta}\frac 1 n \sum_{i=1}^n \mathcal{L}(f(x_i;\theta),y_i)$

Unsupervised learning

Learning patterns when no specific target output values are supplied.(实际上就是没有label)

Examples:

Clustering: group data into groups
Building probabilistic model to explain data
Anomaly detection

Types of machine learning: shallow vs. deep （传统机器学习VS深度学习）

Traditional machine learning	Deep learning
Important step: feature design	Allows raw input
Usually work with “feature vectors”	End-to-end learning
Mapping function is simple, with relatively small number of parameters	Complex models, with millions of parameters
Works well if the input can be captured by vectors, small to medium number of samples	Works well if the “right feature” is unknown or the input is complex and a large number of samples are available

进行ML的流程：

Problem formulation – What is the input? What is the expected outcome?(输入输出？)
Data collection
1. Collect data
2. Annotation（给data以label）
Design machine learning algorithm
1. Choose the machine learning model
2. Choose the objective function
Training machine learning model: Learn the decision system from training data
Applying machine learning model

Ch2 Classification

classification与regression的对比：

Classification	Regression
predict labels	predict values
categorical, often no order	deal with ordered values
all differences are equal	some differences are bigger than others

classifier的一种划分：

binary 二分类
multi-class 多分类
multi-label 多标签

classifier的划分以及应用：

instance-based(KNN): classifiers use observation directly(no underlying model)

K-Nearest neighbours just checks what the class of nearby training data are
generative(Naive Bayes): build a generative statistic model

Naïve Bayes attempts to model the underlying probability distributions of data
discriminative(Decision Tree): directly estimate a decision rule/boundary

Decision trees are concerned with establishing the boundaries between classes

欧式距离 Euclidean Distance:

马氏距离(估计不考)Mahalanobis Distance:

KNN

Test Error的计算：假设classifier是，那么对于我们的预测结果是。如果预测结果与实际的结果之间不同，那么error.

当值变化的时候，训练误差()和测试误差()又会如何变化？有如下三种情况。

值代表的是取周围的个邻居。

当值较小(例如)时，模型的复杂度较高。每个测试样本只考虑其最近的一个邻居，这可能导致模型对噪声和局部变化非常敏感。在这种情况下，训练误差很低，因为模型会完美地匹配训练样本，但测试误差较高，因为模型在未见过的样本上可能无法泛化。
当值较大时(例如)，模型的复杂度较低。考虑更多的邻居可以平滑决策边界并减少噪声的影响。在这种情况下，训练误差可能会增加，因为模型会更多地考虑其他类别的样本，测试误差可能会减少，因为模型在未见过的样本上更具泛化能力。
当值等于样本数量时(例如，其中是样本数量)，相当于使用全部训练数据作为邻居。在这种情况下，模型的复杂度较低，因为它会考虑所有训练样本的投票。训练误差可能会增加，因为模型更加保守。测试误差可能会减少，因为模型使用了更多的信息。

KNN is a type of lazy learning(实际上并没有train的过程)

Lazy: learning only occurs when you see the test example

Eager: learn a model prior to seeing the test example

KNN的好处与坏处（评价）：

Advantages	Disadvantages
Easy to understand	Hard to find a good distance measure
Very flexible decision boundaries	Irrelevant features and noise reduce performance significantly
No learning required	Cannot handle more than a few dozen attributes (Curse of Dimensionality)
	Computationally expensive

overfitting与underfitting

	overfitting	underfitting
特征	Model could fit any data(trainset上结果很好), also fit the noise(testset上的结果很烂)	Model is too inflexible to match the data, too few learnable parameters, wrong learnable parameters(在trainset和testset上表现都很烂，根本没训练好)
解决方法solution	Less complex model, More data, Regularisation (soft reduction in parameters)	Better data/pre-processing, More complex model, Less regularisation

Validation set

在机器学习中，我们通常将数据集划分为训练集（Training Set）、验证集和测试集（Test Set）。验证集（Validation Set）是用于模型选择和超参数调优的一个独立数据集。

训练集: 用于模型的参数估计和训练过程

测试集:用于评估最终模型的性能

验证集:用于选择不同模型和调整超参数的过程

Cross validation: 其中的Leave-one-out cross-validation。将数据集分为多个子集，拿一个作为validation。

Naive Bayes

一些背景知识：

Bayes的limitations：需要有data，无data不工作（基于数据的统计，生成概率）。

Bayes公式：，: 表述的是在x事件发生的背景下y事件发生的概率。

极大似然估计：

Naive Bayes：将Bayes Theorem应用于有着先验概率或者可能性的数据集中。

Naive的来由：数据属性之间的相关性，我们假设是独立的（assume independence）。

Multinomial Naïve Bayes：

$P(A\&B)=P(A)\times P(B)\\ P(Y|A\&B)=\frac{P(Y)*P(A|Y)*P(B|Y)}{P(X)}$

将极大似然估计(Maximum likelihood)应用在Naive Bayes中：

而这样不好计算，于是变成对数函数：

: 其中代表其中的参数。

Application

Bag of words(词袋)

X：words
Y：是否是垃圾邮件(spam)—二分类问题

举例：“the most interesting film this summer”

数学原理：是我们想要得到的最终结果，也就是是否是垃圾邮件的判定。以下公式的推导是基于朴素贝叶斯假设，通过计算每个类别c的后验概率来选择最有可能的类别，实现对文档进行分类。

$P(d|c)&=\prod_{i\in positions}P(w_i|c)\\ \hat c&=\mathop{\arg\max}\limits_{c}P(c|d)\\ &=\mathop{\arg\max}\limits_{c}P(d|c)P(c)\\ &=\mathop{\arg\max}\limits_{c}\prod_{i\in positions}P(w_i|c)P(c)$

P(d|c)：这表示在给定类别c的条件下，文档d出现的概率。它是通过假设文档中的每个词（）在给定类别c的条件下是独立的，并且通过对这些条件概率的乘积来计算的。具体而言，这个公式使用了一个位置集合（positions），表示文档中被考虑的词的位置，然后计算每个位置上词在给定类别c的条件下的概率，并将它们相乘得到。

：这表示通过贝叶斯决策规则（Bayes’ decision rule）选择的最有可能的类别。在这个公式中，我们通过计算每个类别c的后验概率来选择最有可能的类别。根据贝叶斯定理，可以通过计算得到。

：这表示类别c在整个数据集中出现的概率，也称为类别的先验概率。它可以通过计算在训练数据中类别c的频率来估计。

接下来要如何计算呢？还是一样的，不好处理，我们转换为log+。

$\prod_{i\in positions}P(w_i|c)P(c)&=\sum_{i\in positions}log(P(w_i|c))+log(P(c))\\ &=\sum_{k\in|V|}n_klog(P(w_i|c))+log(P(c))$

: The number of occurrences of the k th word in the vocabulary. It can be zero or non-zero. 词汇表中的计数，代表文档document中其中的某个word。

: The list of vocabulary, 词汇表的大小

Decision Tree

一个决策的过程，我们需要确定：

Which variable to test
What to test it against
What function to apply to a variable prior to a test

决策树模拟了一个树形结构，其中每个内部节点表示对一个特征的测试，每个分支代表测试的结果，每个叶节点代表一个类别或回归值。它的构建过程从根节点开始，选择最佳的特征进行测试。选择的依据可以是信息增益（Information Gain：）、基尼系数（Gini Impurity）或其他度量。通过测试特征将数据集划分为不同的子集，然后递归地在每个子集上重复这个过程，直到达到停止条件，如达到最大深度、节点包含的样本数小于某个阈值或节点中的样本属于同一个类别。

分类问题中：决策树的叶节点表示一个类别，通过在树上从根节点到叶节点的路径，决策树可以对新的样本进行分类。
回归问题中：叶节点表示一个预测值，通过在树上从根节点到叶节点的路径，决策树可以对新的样本进行回归预测。

Ch3 Regression

对regression与classification的区分：

classification中Y是离散变量，regression中Y是连续变量。

Application

房价预测
图像处理(Image processing)
人数统计(Crowd counting)
生成声音(Generate sounding)

Linear Regression

线性回归：给定一些样本点，找出一条可以拟合的曲线。

$y=&f(x)\\ y=&w^Tx=\sum^D_{k=1}w_kx_k\\ err=&(f(x)-\hat y)^2\\ \mathcal{L}=&\sum^N_{i=1}(f(x_i)-\hat y_i)^2\\ =&\sum^N_{i=1}(w^Tx_i-\hat y_i)^2\\ =&||w^TX-\hat y||^2$

计算求其对w的导数=0,：

$\frac{\partial\mathcal{L}}{\partial w}&=0\\ \frac{\partial{||w^T X-\hat y||^2}}{\partial w}&=0\\ w&=(XX^T)^{-1}X\hat y^T$

上述的代表最终的结果是一个值scalar，但如果最终的结果是一个向量，则写成。

$w=(XX^T)^{-1}X\hat Y^T$

不能用Regression去Classification的理由：

Limitation:

– Put unnecessary requirements to the predicted output

– May increase the fitting difficulty and lead to bad training result
But why is it commonly used in practice?

– Close-form solution, less storage for low

-dimensional data

– Quick update for incremental learning, distributed learning

Regularized Regression

即添加了正则化惩罚项regularization的Linear Regression。

regularization的用途？

避免过拟合 avoid overfitting
Enforce certain property of solution

其形式：, 为正则化惩罚项。

对p-范式的介绍(p-Norm)：
$||x||_p=(\sum^d_{j=1}|x^j|^p)^{\frac 1 p}$
当p=1的时候：Lasso Regression

当p=2的时候：Ridge Regression

Ridge Regression

其Loss function的形式是：

$\mathcal{L}=||w^TX-\hat y||^2_2+\lambda||w||^2_2$

对其求解：

$w=(XX^T+\lambda I)^{-1}X\hat y^T$

如果不是可逆的，可能就不能写成，这意味着此处有多种方法去处理（而不是求逆）。如果加上一个正则化，那么它总是可逆的。It essentially provides a criterion for choosing the optimal solution among multiple equivalent solutions of the first-term。

Lasso Regression

其Loss Function的形式是：

$\mathcal{L}=||w^TX-\hat y||^2_2+\lambda||w||_1$

L1 norm encourages sparse solution. This could be useful for understanding the impact of various factors, e.g. perform feature selection.
Sometimes it can lead to an improved performance since it can suppressing noisy factors. 其原因是：计算出来。使得某些特征的可以取0。
Unfortunately, it does not have a close-form solution.

Support Vector Regression

Key idea: if the fitting error is already small enough, do not make it smaller.

SVR通过在二维空间中找到一个最优超平面来实现对回归过程的建模。由于这个最优超平面仅考虑到了在训练集周围边缘的点，使得模型对数据点的过拟合现象进行有效地避免。同时，根据再投影误差作为惩罚项的复杂度控制参数可以很好地调节回归模型的灵活性。

硬间隔（Hard-margin）:

$&\min \frac 1 2 ||w||^2\\ s.t. &y_i-wx_i-b\le\epsilon\\ &w_ix_i+b-y_i\le\epsilon$

软间隔（Soft-margin）：

$\min \frac 1 2 ||w||^2+C\sum^m_{i=1}(\xi_i+\xi_i^*)\\ s.t.\begin{cases} &y_i-wx_i-b\le\epsilon+\xi_i\\ &wx_i+b-y_i\le\epsilon+\xi_i^*\\ &\xi_i,\xi_i^*\ge0,i=1,\dots,m \end{cases}$

SVR:

$\min_{w,b}\frac 1 2 ||w||^2+\sum_{i}\max(1-y_i(w^Tx_i+b),0)$

表示在与0之间取大的那个。

如果是binary的：那么假设对于positive class是正数，对于negative class 是负数。当decision value不够大（或者是不够小）的时候，我们就会“激活”这个惩罚项。

Dual form of SVR

原始问题：