[英]Re-use of clustering algorithm in python
我已經構建了一個集群 model 可以很好地分割數據。 我使用了一個兩步過程(KMeans 然后 Hierarchical)來避免在我直接嘗試 Hierarchical 時發生的 memory 問題(請參閱參考https://www.dummies.com/programming/big-data/data-science/data-科學執行分層聚類與 python/ )。
我的問題與現在如何利用這個過程來為新信息打分有關。 我試圖保持我的代碼結構化,我想“導出”和“導入”相關代碼,但我不知道如何導出這兩個模型。 這是我的代碼:
data_scaled = normalize(col_final_df)
data_scaled = pd.DataFrame(data_scaled, columns=col_final_df.columns)
clustering = KMeans(n_clusters=km_seg, n_init=10,
random_state=1)
clustering.fit(data_scaled)
post_clust_centres = clustering.cluster_centers_
post_clust_data_mapping = {case: cluster for case, cluster in enumerate(clustering.labels_)}
print('KMeans analysis complete. Composing hierarchical segmentation of KMeans presently...')
Hclustering = AgglomerativeClustering(n_clusters=29, affinity="cosine", linkage ="complete")
Hclustering.fit(post_clust_centres)
print('Hierarchical segmentation complete. Composing dendrogram...')
plt.title('Hierarchical Clustering Dendrogram')
plot_dendrogram(Hclustering, labels=Hclustering.labels_)
plt.show()
H_mapping = {case: cluster for case,
cluster in enumerate(Hclustering.labels_)}
final_mapping = {case: H_mapping[post_clust_data_mapping[case]]
for case in post_clust_data_mapping}
所以酸洗很容易,因為我可以保存整個 object 並根據需要在新的 function 中重新導入。 意識到它會使用太多的 i/o,我會確保我只做一次。
為了腌制,我在集群算法的末尾添加了以下代碼。
with open(Config.PATH + '/kmeans.pickle', 'wb') as handle:
pickle.dump(clustering, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open(Config.PATH + '/hclust.pickle', 'wb') as handle:
pickle.dump(Hclustering, handle, protocol=pickle.HIGHEST_PROTOCOL)
然后這是我用來導出數據向量段的評分代碼:
def score_data(data):
with open(Config.PATH + "/kmeans.pickle", 'rb') as handle:
clustering = pickle.load(handle)
with open(Config.PATH + "/hclust.pickle", 'rb') as handle:
Hclustering = pickle.load(handle)
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
clustering.labels_ = clustering.predict(data_scaled)
post_clust_data_mapping = {case: cluster for case, cluster in enumerate(clustering.labels_)}
H_mapping = {case: cluster for case,
cluster in enumerate(Hclustering.labels_)}
final_mapping = {case: H_mapping[post_clust_data_mapping[case]]
for case in post_clust_data_mapping}
final_mapping_ls = list(final_mapping.values())
return [x + 1 for x in final_mapping_ls]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.