Schwertlilien
As a recoder: notes and ideas.

机器学习-Ch7-10

主要内容:K-meansGMMKernel Method集成学习CNN

Contents

[TOC]

Ch7 Clustering

Unsupervised learning

Learning without supervision

  • Find the distribution of data
  • Learning to generate samples
  • Clustering
  • Anomaly detection
  • Feature learning

Clustering

  • One of the most important unsupervised learning tasks
  • Clustering is the process of identifying groups, or clusters, of data points in a (usually) multidimensional space.
  • Connection to distribution modeling: related to mixture model

Clustering: discover groups such that samples within a group are more similar to each other than samples across groups.

Clustering的Application:

  • Segmentation
  • Superpixel
  • Vector quantization in Bag-of-feature model

Clustering的Ingredient:

  • A dissimilarity function between samples
  • A loss function to evaluate clusters
  • Algorithm that optimizes this loss function

衡量dissimilarity function: \[ D(x_i,X_i')=\sqrt{\sum^p_{i=1}(x_{ij}-x_{i'j})^2} \] \(x_{ij}\): \(x_i\)的特征点,j=1,2,\(\dots\),p

Clustering is invariant to rotation and translation of features, but not to scaling.

K-means Clustering

Basic idea:

  • Each sample can only fall into one of the k groups
  • From each group, we can calculate the mean vectors, that is, the average of samples falling into a group
  • Any sample should be closest to the mean vector of its own group than the mean vectors of other groups

K-means take an iterative approach for solving the problem.

Algorithm:

  • E-step: fixed the current mean vectors of each group. If a sample is closest to the i-th mean vector, then the sample is assigned to the i-th group
  • M-step: fixed the assignment, calculate the mean vector of each group
  • Iterate E step and M step, until converge

Questions:

Why using those two steps can lead to converged result?

通过反复迭代E-steps和M-steps, K-means算法会逐渐优化簇分配和质心的位置,直到达到收敛状态。在收敛状态下,簇分配不再发生变化,质心的位置也相对稳定。这意味着算法已经找到了一个相对稳定的聚类结果,达到了收敛状态。

Why always calculate mean vectors?

为了更新簇的位置,并在每次迭代中重新计算数据点与质心之间的距离,从而进行簇分配。

What is the objective function of k-means clustering algorithm?

– Objective function will give a measurement of the goodness of clustering results

\[ \mathcal{J} = \sum_{i=1}^{n} \sum_{i=1}^K r_{ik} ||x_i - μ_k||^2_2 \]

\(r_{ik}\): Each point will be assigned to one of the prototypes, the distance between the point to the prototype is the cost of the assignment. We are seeking the optimal assignment that can minimize the cost.

\(\mu_k\): We are also seeking the optimal prototypes that can minimize the total cost.

Choose K? Cross validation

Limitations of k-means

  1. Hard assignments of data points to clusters can cause a small perturbation to a data point to flip it to another cluster
  2. Assumes spherical clusters and equal probabilities for each cluster
  3. Those limitations can be solved by GMM based clustering

Gaussian Mixture Models(GMM)

混合高斯模型:GMM can characterize the distribution that contains K clusters. GMM假设观察的数据点是从多个高斯分布的混合中生成的,每一个高斯分布有一个\(\pi_k\)(weight),表示选择这种分布的可能性。 \[ 多元高斯分布:P(x)=\mathcal{N}(x|\mu_k,\Sigma_k)=\frac{exp(-\frac{1}{2}(x-\mu)^T\Sigma^{-1})(x-\mu)}{\sqrt{(2\pi)^k|\Sigma|}}\\ 高斯混合模型:P(x)=\sum^K_{k=1}\pi_k\mathcal{N}(x|\mu_k,\Sigma_k),\sum^K_{k=1}\pi_k=1,0\le\pi_k\le1 \] Generative process:

  1. Randomly select the k-th Gaussian distribution according to \(\pi_k\)

  2. Randomly sample x from the selected Gaussian

    distribution

\[ P(\gamma=k)=\pi_k\\ P(x|\gamma=k)=\mathcal{N}(x|\mu_k,\sum_k)\\ P(x)=\sum^K_{k=1}\pi_k\mathcal{N}(x|\mu_k,\Sigma_k) \]

\(\gamma\): latent variable, 用于从多个高斯分布中选择特定的高斯分布。\(\gamma\)根据相关概率(\(\pi_k\))去选择哪个高斯分布去生成数据点。

Latent Variable

  1. Intermediate results inside a generative process
  2. Each sample is associated with a latent variable
  3. We do not know its exact value
  4. But we can infer how likely it can be

对于隐变量,我们无法直接观察或测量它,但是它对最终的结果有影响。只能推测其可能性。

比如学生的学习能力,是latent variable,会对考试结果产生影响,但是我们不能确切地衡量学生地学习能力。

parameter estimation for GMM: use MLE(最大似然估计),然后求解偏导\(\frac{\partial \mathcal{L}}{\partial \theta}=0\)

EM algorithm solution to GMM

  • E-step: calculate the posterior probability about the latent variable (for each \(x_i\) calc prob \(r’_{ik}\) for each cluster, this corresponds to assigning \(x_i\) to a cluster in k-means) \[ r'_{ik}=P(r_{ik}=1|x_i)=\frac{\mathcal{N}(x_i|\mu_k,\Sigma_K)\pi_k}{\sum^K_{j=1}\mathcal{N}(x_i|\mu_j,\Sigma_j)\pi_j} \]

  • M-step: estimate parameters based the expectation of latent variables (recalc GMM based on estimate in step E, in k-means this is recalc the centroids) \[ \pi_k=\frac{\sum_i r'_{ik}}{N}\\ \mu_k=\frac{\sum_ir'_{ik}x_i}{\sum_ir'_{ik}}\\ \Sigma_K=\frac{\sum_ir'_{ik}(x_i-\mu_k)x_i-\mu_k)^T}{\sum_ir'_{ik}} \]

EM algorithm (More general case)

  1. The above solution represents a special case of a more general method called EM-algorithm
  2. It is widely used for learning the probabilistic models with latent variable inside its generative process.
  3. It iterates between an E-step and an M-step.
  4. We can theoretically prove that each step will lead to a nonincrease of loss function
  5. The iterative process will converge to a local minimum

其通用形式如下:

  • E-step:计算隐变量的后验概率:\(P(z|x,\theta^t)\), \(\theta^t\)是目前模型的参数

  • M-step:基于期望的似然估计更新模型参数 \[ \theta^{t+1}=\mathop{\arg\max}\limits_{\theta}E_{z|x,\theta}\log P(z,x|\theta)=\int P(z|x,\theta^t)\log(P(x|\theta,z)P(z|theta))dz \]

与K-means 的 connection

  • E-step in GMM is a soft version of K-means. \(r_{ik}\in[0,1]\), instead of \(\{0,1\}\).
  • M-step in GMM estimates the probabilities and the covariance matrix of each cluster in addition to the means.
  • All \(\pi_k\) are equal. \(\sum_k=\delta^2I\), as \(\delta^2\rightarrow0,r_{ik}\rightarrow\{0,1\}\), and the 2 methods coincide.

Ch8 Kernel Method

Kernel function的性质(证明):

  • positive semi-definite(半正定):\(c^Tkc\ge0\)

Polynomial Kernel

多次项核函数,比较常用。其形式如下: \[ K(x_i,x_j)=(x_i\cdot x_j+c)^d \]

Radial Basis Function Kernel(高斯核)

更常用的核函数,其形式如下: \[ K(x_i,x_j)=e^{-\gamma(x_i-x_j)^2}\\ =e^{-\frac{||x_i-x_j||^2}{2\sigma^2}} \]

Kernel in SVM

我们原始的对偶问题是: \[ \mathcal{L}=\sum\alpha_i-\frac 1 2 \sum_i\sum_j\alpha_i\alpha_jy_iy_jx_ix_j \] 新的对偶问题实际上是把\(x\)换成了\(\phi(x)\)(升维): \[ \mathcal{L}=\sum\alpha_i-\frac 1 2 \sum_i\sum_j\alpha_i\alpha_jy_iy_j\phi(x_i)\phi(x_j) \] 我们选定核函数:原先的升维函数不宜求解,但核函数可以求解 \[ k(x_i,x_j)=\phi(x_i)\cdot\phi(x_j) \]

  • We stop caring about what the transform represents
  • We start caring about how well it classifies our data

其他详细地推导没有,主要是在CVX中的应用。

Kernel in K-means

Map the data space into some kernel space.

Hope this new space better allows the data to be separated!

下面是K-means的过程:

1.Assigning Points to Clusters

(原来的) \[ assign(x_i)=\mathop{\arg\min}\limits_{k}||x_i-\mu_k||^2\\ =(x_i-\mu_k)^T(x_i-\mu_k) \] (将\(x_i\)换成了\(\phi(x_i)\)的) \[ assign(x_i)=\mathop{\arg\min}\limits_{k}||\phi(x_i)-\mu_k||^2\\ =(\phi(x_i)-\mu_k)^T(\phi(x_i)-\mu_k) \] 2.Defining Cluster Mid-Points

(原来的) \[ \mu_k=\frac 1 {N_k}\sum^N_{i=1}z_{ik}x_i \] \(Z_{ik}=1\):样本\(x_i\)属于k类,=0不属于k类

(将\(x_i\)换成了\(\phi(x_i)\)的) \[ \mu_k=\frac {\sum_{i=1}z_{ik}\phi(x_i)}{\sum_{i=1}z_{ik}} \]

Kernel in Linear Regression

原来的: \[ \sum_i||w^Tx_i-y_i||^2 \] 现在的: \[ \sum_i||w^T\phi(x_i)-y_i||^2 \] 我们令\(w=\sum_j\alpha_jx_j+o\),代入目标函数得: \[ \sum_i||(\sum_j\alpha_jx_j+o)^Tx_i-y_i||^2_2\\ =\sum_i||\sum_j\alpha_jx_i\cdot x_j+x_i\cdot o-y_i||^2_2\\ =\sum_i||\sum_j\alpha_jK(x_i,x_j)+K(x_i,o)-y_i||^2_2\\ =\sum_i||\sum_j\alpha_jK(x_i,x_j)-y_i||^2_2 \]

o: 映射函数φ(x)可能带入的一个额外 bias term

\(K(x_i,o)=0\):

  1. o不代表任何一个实际样本点。它仅仅表示引入核技巧可能带来的一个额外的bias项。
  2. 因此,任何样本点\(x_i\)与o的距离\(||x_i - o||\)都是无定义的。无法直接计算核函数值。
  3. 因此就直接定义=0

Kernel in PCA

Tasks Problems
Replace x with \(\phi(x)\) Centralising the data
Turn everything into dot products Calculating the covariance matrix
Use \(K(x_i , x_j )\) a lot Calculating the projection matrix
Never use \(\phi(x)\)

\[ Cov=X^TX\\ Cov(k,m)=\sum_i^nX_i^kX^m_i\\ =\sum^n_{i,j=1}c_ic_jk(x_ix_j)=c^TKc\ge0 \]

计算协方差矩阵\[ \bar X\bar X^Tv=\lambda v\\ \]

\(\bar X\): centralised data matrix(d x n)

d: 属性的维数,n:有多少样本

\(\bar X\bar X^T\):(d x d)协方差矩阵

\(v\): 协方差矩阵的特征向量

经过kernel matrix: \[ \bar X\bar X^Tv'=\lambda v'\\ v=\bar Xv' \] 计算去中心化\[ \bar k(x_i,x_j)=(\varphi(x_i)-\mu)^T(c-\mu)\\ =(\varphi(x_i)-\frac 1 N\sum_k\varphi(x_k))^T(\varphi(x_j)-\frac 1 N\sum_k\varphi(x_k))\\ =\varphi(x_i)\cdot\varphi(x_j)-\frac 1 N\sum_k\varphi(x_i)\cdot\varphi(x_k)-\frac 1 N\sum_k\varphi(x_j)\cdot\varphi(x_k)-\frac 1 {N^2}\sum_k\varphi(x_i)\cdot\varphi(x_j)\\ =k(x_i,x_j)-\frac 1 N \sum_kk(x_i,x_k)-\frac 1 N \sum_kk(x_j,x_k)-\frac 1 {N^2} \sum_{ij}k(x_i,x_j)\\ K'=K-1_NK-k1_N+1_NK1_N \]

\(1_N\) here represents a NxN matrix filled with 1/N

计算投影矩阵:原来是\(y=P^T(x-\mu)\),现在是: \[ y_k=p_k^T(\varphi(x_t)-\mu) \] 带入\(v=\varphi(\bar X)v'\): \[ y_k=v'^T_k\varphi(\bar X)^T(\varphi(x_t)-\frac 1 N\sum_i\varphi(x_i))\\ =v'^T_k(\varphi(X)-\frac 1 N\sum_i\varphi(x_i))^T(\varphi(x_t)-\frac 1 N\sum_i\varphi(x_i)) \]

Ch9 Ensemble Method

When to use?

  • Individual decisions are cheap, so making many of them is easy
  • Making an expert is expensive or impossible
  • Your amateurs are independent of each other

Decision Tree

Entropy=\(\sum-p_i\log_2p_i\)(Gini Function), \(p_i\)代表属于class i 的概率

Gini Coefficient of Purity是利用Gini不均衡度来度量分类问题样本集合的纯程度。它服判定模型预测性能的一个指标(衡量决策树每个节点的纯度):

  • 它通过计算样本集合中的类分布不均衡程度,来反映纯度程度。
  • 如果样本全是同一类,类分布最均衡,Gini值为0,这也就是100%的纯度。
  • 否则不同类样本混杂,Gini值越高,表示纯度越低。

The Flaw of the Decision Tree:

  • We have a series of decisions, but some of these decisions can only be conceived of as overfitting. (根据训练数据集递归地构建依赖于数据的决策边界,不断地细分训练集样本)
  • There are numerous places we could place decision boundaries. (会导致决策边界过于复杂,训练集效果不错,但是测试集的效果很差)
  • How can we avoid these overfit decision boundaries?--比如剪枝,或限定最大树深

Random Forest

定义:A random forest is a series of decision trees which are randomized. You then apply your new measurement to each decision tree, picking the aggregate decision by voting.

容易出现的问题:

  • Some potential decision boundaries represent overfitting.
  • Some features are not representative of generalisation.

我们使用bootstrapping方法(随机有放回地采样--random selection with replacement)得到子数据集(subsets)

值得考虑地问题是:

How many trees?

  • Too few trees:
    • Might interpret random variation as signal.
    • Biased towards individual trees.
  • Too many trees: Slow?

How many dimensions/tree? --从实践来看一般是\(\sqrt{D}\)或是\(\log(D)\)

Bagging doesn’t choose a random subset of dimensions

  • Bagging算法为了减轻过拟合,其主要思想是从原始训练数据集中采样出多个并行的子数据集,来训练多个基学习器模型。(采样过程使用的是boostrapping方法)
  • 但在采样子数据集时,Bagging采样的是样本实例,而不是特征维度。
  • 也就是说,每个子数据集中包含的是原始所有特征维度对应的一部分样本。而不是随机选择一部分特征维度。

Boosting

idea: We build a classifier, but it incorrectly classifies some samples, we can further train this classifier to perform better on the incorrectly labelled samples(从错误中学习)

  • We make decisions based on a weighted decision of minimal ‘weak classifiers’ (stumps)
  • After we have built our ‘one stump’ we can judge how good it is, by the number of instances it gets wrong.

\[ \alpha=\frac1 2\ln(\frac{1-TotalError}{TotalError}) \]

\(\alpha\)代表每个decision tree 在最终的决策中的贡献程度。我们通过它更新权重: \[ w_{new}=w_{old}\times e^a \]

Adaboost

目标函数: \[ \min_{\{a_k,\theta_m\}}\sum_ie^{-y_ih_m(x_i)}\\ h_m(x)=\alpha_1h(x;\theta_1)+\dots+\alpha_mh(x;\theta_m) \] loss function: \[ e^{-yh_m(x)} \]

还有其他的boost的方法,比如XGboost,Gradient Boost。

What do we have in common?

  • We are weighting our bad decisions higher (i.e., boosting)
  • We are using ensemble learning

Ch10 Deep Learning

关于Loss的介绍:

  1. Loss is some measure of the error between the predicted and target values
  2. A metric for determining how well a model has trained
  3. Lower loss/cost/error => better model => better learned parameters

我们常用的Loss是MSE(Mean Squared Error): \[ MSE(X,h_{\theta})=\frac1 m\sum^m_{i=1}(\theta^Tx_i-y_i)^2 \]

全局最小值以及局部最小值:Global Minimum is the smallest error out of all possible parameter configurations, Local Minimum is the smallest error in some region of space.

一般使用梯度下降(Gradient Descent)来更新参数: \[ \theta_{n+1}=\theta_n-\eta\nabla_{\theta}MSE(\theta_n) \]

Learning Rate(\(\eta\))

\(\eta\) is a Hyperparameter!

Choice of learning rate is very important:

  • Too small… takes forever to train, can easily get stuck in tiny local minima
  • Too big… can overstep or become unstable

Activation Function:常用的有sigmoid,relu,Tanh函数

Softmax:将结果转化为概率分布 \[ softmax:\frac{exp(s_k(x))}{\sum^K_{i=1}exp(s_i(x))}\\ cost\ function:\mathcal{J}(\theta)=-\frac 1 m \sum^m_{i=1}\sum^K_{k=1}y_k^i\log(\hat p_k^i) \]

Neural Networks

Terms:

  • Iterations (one per ‘batch)
  • Epochs (one per ‘all samples’)
  • Batches(对于图片来说就是mega-pixels)

CNN

其步骤如下:

1.卷积层(Convolutional Layer)--利用padding补全

  • 使用卷积操作提取图像中的特征,如边缘、色彩等。
  • 参数包括过滤器(kernel)和偏置。

2.池化层(Pooling Layer)

  • 对特征图区域进行下采样,降低空间维度,增加翻转不变性和控制过拟合。
  • 常用最大池化和平均池化。

3.激活层(Activation Layer)

  • 为卷积层和全连接层添加非线性机能,如ReLU、Sigmoid、Tanh等。
  1. 全连接层(Fully Connected Layer)
  • 将上一层特征整合成一维数组,与该层权值相乘实现分类或回归功能。

4.Softmax层

  • 将全连接层输出进行Softmax归一化,输出概率评分实现分类。
搜索
匹配结果数:
未搜索到匹配的文章。