0
0
0
1. 云栖社区>
2. 博客>
3. 正文

## 无监督学习 聚类分析③

### 确定最佳聚类数目

• #### Nbclust包

Nbclust包是《R语言实战》上一个包，定义了几十个评估指标，聚类数目从2遍历到15（自己设定），然后通过这些指标看分别在聚类数为多少时达到最优，最后选择指标支持数最多的聚类数目就是最佳聚类数目。

``````
library(gclus)

data(wine)

head(wine)

dataset <- wine[,-1] #去除分类标签
dataset <- scale(dataset)

library(NbClust)

set.seed(1234) #因为method选择的是kmeans，所以如果不设定种子，每次跑得结果可能不同
nb_clust <- NbClust(dataset,  distance = "euclidean",
min.nc=2, max.nc=15, method = "kmeans",
index = "alllong", alphaBeale = 0.1)

barplot(table(nb_clust\$Best.nc[1,]),
xlab = "聚类数",ylab = "支持指标数")
``````

• #### SSE(组内平方误差和)

``````
wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)\$withinss)
}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}

wssplot(dataset)

``````

• #### factoextra包

``````
library(factoextra)
library(ggplot2)
set.seed(1234)
fviz_nbclust(dataset, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)

``````

``````km.res <- kmeans(dataset,3)
fviz_cluster(km.res, data = dataset)

``````

+ 关注