Through the results, it can be found that the kNN we implemented is consistent with the kNN provided in sklearn.
We can further optimize the accuracy by setting the value of k and transforming the strategy of finding similar samples (replacing the Euclidean distance with the matching coefficient or Jaccard, etc.).
Naive_Bayes
Import Dataset
# load datasets from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn import naive_bayes
Since all the 30 attributes are continuous values, we need to divide the range of attribute values into several intervals when using Naive Bayes, and calculate the probability of the instance falling in this interval. I’m dividing each property here by the average.
Divide consecutive attributes into intervals
defdistribution(X_train, y_train): ''' 先把区间分好,然后再计算概率。 ''' #===============区间划分====================# attributes_max_min_mean = [] # 记录所有属性的最大值、最小值和平均值 for i inrange(len(X_train[0])): #属性循环 #section = [max, min, mean] section = [X_train[0][i], X_train[0][i], 0] for instance in X_train: #训练样例循环 if instance[i] > section[0]: section[0] = instance[i] if instance[i] < section[1]: section[1] = instance[i] section[2] += instance[i] section[2] /= len(X_train) attributes_max_min_mean.append(section) #=========计算每个属性落在每个区间的样例个数=========# instance_distribution = [] for i inrange(len(X_train[0])): #属性循环 smaller_benign = 0 larger_benign = 0 smaller_malignant = 0 larger_malignant = 0 for j inrange(len(X_train)): #训练样例循环 if X_train[j][i] > attributes_max_min_mean[i][2]: if y_train[j] == 1: larger_benign += 1 else: larger_malignant +=1 elif y_train[j] == 1: smaller_benign += 1 else: smaller_malignant += 1 instance_distribution.append([smaller_benign, larger_benign, smaller_malignant, larger_malignant]) return instance_distribution, attributes_max_min_mean
Through the experimental results, it can be found that the naive Bayes we implemented is better than the naive Bayes provided by sklearn.
We can further optimize the accuracy by trying different interval divisions for each attribute. The reason why the naive Bayes provided by sklearn may not work well is that the interval partition to convert continuous values to discrete values is not done well.