[英]why a better performance (silhouette score) in python 2.7 than 3.6?
I read a lot on SoF about the difference in speed between Python 2.7 and 3.6.我在 SoF 上阅读了很多关于 Python 2.7 和 3.6 之间速度差异的内容。 but my question is more about performance between the two versions.但我的问题更多是关于两个版本之间的性能。
I used for document clustering: TF-IDF + KMeans and score silhouette to evaluate the homogeneity of my clusters.我用于文档聚类:TF-IDF + KMeans 和分数轮廓来评估我的集群的同质性。
By switching from Python 3.6 to Python 2.7, my silhouette score has increased by +0.20!通过从 Python 3.6 切换到 Python 2.7,我的轮廓分数增加了+0.20!
**Would someone have an explanation? **有人能解释一下吗? ** Thanks! ** 谢谢!
code :代码 :
tfidf = TfidfVectorizer(
stop_words=my_stopwords_str,
max_df=0.95,
min_df=5,
token_pattern=r'\w{3,}',
max_features=20)
tfidf.fit(data_final.all_text)
data_vect = tfidf.transform(data_final.all_text)
num_clusters = 15
kmeans = KMeans(n_clusters=num_clusters, init='k-means++',
max_iter=300).fit(data_vect_lsa)
kmeans_predict = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=300).fit_predict(data_vect_lsa)
silhouette_score(data_vect, labels = kmeans_predict, metric='euclidean')
The output for Python 2.7 is : Python 2.7 的输出是:
0.58234789374593758
The output for Python 3.6 is : Python 3.6 的输出是:
0.37524101598378656
Try again.再试一次。 A single sample is not enough.单个样本是不够的。
K-means begins with a random setting, and may find a local optimum only. K-means 从一个随机设置开始,可能只能找到一个局部最优。
It's fairly common to see different results when running it multiple times.多次运行时看到不同的结果是很常见的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.