Fisher, SVM, K-Means and their optimization
Implement Fisher, SVM and K-Means algorithms and optimize them.
Fisher
algorithm and its optimization
Implement Fisher
Algorithm and validate by random generating data.
Generating data
from sklearn.datasets import make_multilabel_classification |
make_multilabel_classificationn_samples
: the number of samplesn_features
: the number of features.n_labels
: the number of labels.n_classes
: the number of classes.random_state
: set the random seed to ensure generating the same data everytime.
enumerate()enumerate()
is used to combine a traversable data object (such as a list, tuple, or string) into an index sequence, listing both the data and the data subscripts, typically in a for
loop.
The implementation of Fisher
def cal_cov_and_avg(samples): |
np.mean
: calculate the average on the specified axisnp.zeros
: generate an array of the given shape and filled with zeros.
Determine the category
def judge(sample, w, c_1, c_2): |
Plot
import matplotlib.pyplot as plt |
<Figure size 640x480 with 1 Axes>
The detailed derivation of SVM
optimization dual problem
reference: https://zhuanlan.zhihu.com/p/49331510
The implementation of SVM
reference: https://www.jb51.net/article/131580.htm
from sklearn import svm |
W: [0.95070185 1.15607502]
a: -0.8223530762163854
support_vectors_: [[-0.51174781 -0.10411082]
[ 0.16323595 -0.66347205]
[ 2.39904635 -0.77259276]
[ 0.66574153 0.65328249]
[-0.25556423 0.97749316]]
clf.coef_: [[0.95070185 1.15607502]]
The implementation of k-means
Reference1: https://cloud.tencent.com/developer/article/1465020
Reference2: https://blog.csdn.net/weixin_42029738/article/details/81978038
''' |
K-Means clustering algorithms
The optimization of k-means
1. `k-means++`(Change the way you choose the center point)
2. `elkan K-Means`(Reduce unnecessary distance calculations)
3. `ISODATA` Algorithm (Adjust the number of clustering centers K according to the actual situation during operation)
4. `Mini Batch k-means` Algorithm (Adopting partial samples and discarding some accuracy greatly accelerates the convergence speed)
k-means++
Reference1: https://blog.csdn.net/github_39261590/article/details/76910689
Reference2: https://www.cnblogs.com/yszd/p/9672885.html
The basic idea of the algorithm for selecting the initial seeds is: The initial cluster centers should be as far away from each other as possible.
Algorithm steps:
Select a point randomly from the set of input data points as the first clustering center
For each point x in the data set, calculate the distance D(x) between it and the nearest cluster center (referring to the selected cluster center).
Select a new data point as the new cluster center. The selection principle is as follows: points with larger D(x) have a greater probability of being selected as the cluster center
Repeat 2 and 3 until K cluster centers are selected
Use these K initial cluster centers to run the standard k-means algorithm
import numpy as np |
elkan k-means
Reference: https://blog.csdn.net/u014465639/article/details/71342072
Elkan k-means
uses the triangle property that the sum of the two sides is greater than or equal to the third side, and the difference between the two sides is less than the third side to reduce the distance calculation.
The first rule is for a sample point and two centers of mass. If we pre-calculate the distance between these two centers of mass, then if we find it, we know it immediately. Now we don’t have to calculate it anymore, so we’re saving a step.
The second rule is for a sample point and two centers of mass. We can get. This is also easy to get from the triangle property.
Using the above two rules, ELkan k-means has a great improvement in iteration speed compared with traditional k-means. However, if the features of our samples are sparse and there are missing values, this method will not be used. In this case, some distances cannot be calculated, so this algorithm cannot be used.
ISODATA
Reference1: https://blog.csdn.net/houston11235/article/details/8511379
Reference2: https://www.cnblogs.com/huadongw/p/4101422.html
One disadvantage of k-means is that you have to specify the number of clusters, which sometimes doesn’t work very well. Therefore, it is required that the number of this category can also be changed, which forms the ISOData method. By setting some conditions for category splitting and merging, the number of categories can be automatically increased or decreased in the process of clustering. The problem with this, of course, is that this condition is not always easy to give. Of course, ISODATA can still get more reliable results in many cases.
Mini Batch k-means
Reference1: https://cloud.tencent.com/developer/article/1465020
Mini Batch KMeans algorithm is a clustering model ** that can keep the clustering accuracy as far as possible but can greatly reduce the computation time. It uses small batches of data subsets to reduce the computation time while still trying to optimize the objective function. The so-called Mini Batch KMeans algorithm refers to the data subset randomly selected during each training algorithm. Using these randomly selected data for training greatly reduces the computing time and reduces the convergence time of KMeans algorithm, but it is slightly worse than the standard algorithm. It is suggested that when the sample size is more than ten thousand for clustering, it is necessary to consider the selection of Mini Batch KMeans algorithm.
''' |
Comparison of the K-Means and MiniBatchKMeans clustering algorithms