学习笔记 | 生活平淡又惊奇

K-means 的特征选择

2021.09.09

在处理高维数据集时,我们可能会遇到聚类方法的问题。特征选择是监督学习的一种众所周知的技术,但对于无监督学习(如聚类)方法来说却少之又少。在这里使用一个相对简单的贪心算法来对 Kaggle 上的欧洲数据集执行变量选择。

算法步骤

  1. 确保变量是数字且经过缩放的,例如使用 StandardScaler() 及其 fit_transform() 方法
  2. 选择要保留的最大变量 ( maxvars )、最小和最大簇数 ( kmin和kmax ) 并创建一个空列表:selected_variables。
  3. 从 kmin 循环到 kmax。然后,依次使用每个变量,使用 K 均值记录每个变量组合和聚类数量(从 kmin 到 kmax)的轮廓值。
  4. 选择给出最大轮廓值的变量,将其添加到selected_variables并将其从要测试的变量列表中删除。
  5. 通过使用selected_variables列表并依次添加每个剩余变量来重复 2 和 3 中的过程,直到达到某个停止标准(在本例中为要保留的变量数maxvars)。

具体代码执行

定义并初始化一些变量

maxvars=3
kmin=2
kmax=8
kmeans_kwargs = {"init": "random","n_init": 20,"max_iter": 1000,"random_state": 1984}
cut_off=0.5
# We also define a cols variables containing a list of all features:
cols=list(df.columns)
# We also set a list and a dictionary to store the silhouette values
# for each number of clusters tested so we can choose the k value
# maximising the silhouette score, with its corresponding features
results_for_each_k=[]
vars_for_each_k={}

然后,创建三个嵌套循环,外部循环遍历k的值(簇数)。然后有一个 while 循环检查保留变量的数量是否低于maxvars设置的阈值。selected_variables列表将保存保留的功能名称。结果列表将保存每个变量的轮廓值。

for k in range(kmin,kmax+1):
    selected_variables=[]
    while(len(selected_variables)<maxvars):
        results=[]
    selected_variables=[]
    print(k)
    while(len(selected_variables)<maxvars):
        results=[]

内部循环会一一遍历所有特征,将它们添加到已选择的变量(如果有)中,并评估轮廓值。然后它选择获得最高值的变量并将其添加到selected_variables列表中。

for col in cols:
            scols=[]
            scols.extend(selected_variables)
            scols.append(col) 
            kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
            kmeans.fit(df[scols])
            results.append(silhouette_score(df[scols],   kmeans.predict(s)))
# We identify the best variable, add it to our list and remove it 
# from the list of variables to be tested on the next iteration
        selected_var=cols[np.argmax(results)]
        selected_variables.append(selected_var)
        cols.remove(selected_var)

然后可以在循环中更新该特定 k 值的变量列表和分数。

results_for_each_k.append(max(results))
vars_for_each_k[k]=selected_variables

最后,三个循环运行后,我们可以确定 k 和变量的最佳组合,拟合模型并绘制它。

best_k=np.argmax(results_for_each_k)+kmin
selected_variables=vars_for_each_k[best_k]
kmeans = KMeans(n_clusters=best_k, **kmeans_kwargs)
kmeans.fit(df_[selected_variables])
clusters=kmeans.predict(df[selected_variables])

结果

根据人口、财富和犯罪率划分的 2 个国家组的最终结果:
image-1690612819161
如果选择 3 个簇,就会得到不同的选择:
image-1690612885468
每组国家的一些示例:

Cluster1: Iceland, Switzerland, Belgium, Germany, Luxembourg, Netherlands, Austria and United Kingdom

Cluster 2: Greece, Spain, France, Croatia, Italy, Cyprus, Latvia, Lithuania, Hungary, Malta, Poland, Portugal

Cluster 3: Norway, Denmark, Finland and Sweden

完整代码

scaler = StandardScaler()
df=scaler.fit_transform(df)

kmeans_kwargs = {"init": "random","n_init": 20,"max_iter": 1000,"random_state": 1984}
cut_off=0.5
maxvars=3
kmin=2
kmax=8

cols=list(df.columns)
results_for_each_k=[]
vars_for_each_k={}

for k in range(kmin,kmax+1):
    selected_variables=[]
    while(len(selected_variables)<maxvars):
        results=[]
        for col in cols:
            scols=[]
            scols.extend(selected_variables)
            scols.append(col) 
            kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
            kmeans.fit(df[scols])
            results.append(silhouette_score(df[scols], kmeans.predict(df[scols])))
        selected_var=cols[np.argmax(results)]
        selected_variables.append(selected_var)
        cols.remove(selected_var)
    results_for_each_k.append(max(results))
    vars_for_each_k[k]=selected_variables


best_k=np.argmax(results_for_each_k)+kmin
#you can also force a value for k
#best_k=3
selected_variables=vars_for_each_k[best_k]
kmeans = KMeans(n_clusters=best_k, **kmeans_kwargs)
kmeans.fit(df[selected_variables])
clusters=kmeans.predict(df[selected_variables])

%matplotlib inline

fig = plt.figure(figsize=(15,15))
#plt.rcParams['font.size'] = 22
ax = plt.axes(projection="3d")
z_points = df_[selected_variables[0]]
x_points = df_[selected_variables[1]]
y_points = df_[selected_variables[2]]
f1=ax.scatter3D(x_points, y_points, z_points, c=clusters,cmap='Accent',s=300);

ax.set_xlabel(selected_variables[0],fontsize = 20)
ax.set_ylabel(selected_variables[1],fontsize = 20)
ax.set_zlabel(selected_variables[2],fontsize = 20)
ax.legend(clusters)

plt.title('KMeans used on the Europe Datasets',fontsize = 24)
plt.show()