如何優化以下字典的結構？

Question

您好，我有兩個數組，如下所示：

print("list of clusters",y_pred[:10]
print("list of comments",listComments[:10])

輸出：

list of comments ['hello This', 'Fabiola hello', 'I am using',  ...

list of clusters [ 2 11  2  2 11  2  2  2  2  2]

建立了集群列表，將kmeans應用於“評論列表”的每個評論，因此這兩個列表具有相同的長度：

y_pred = kmeans.predict(tfidf)
print("length list comments",len(listComments))
print("length list clusters",len(y_pred))

輸出：

length list comments 17223
length list clusters 17223

然后，我想將屬於特定編號的所有注釋分組，例如，創建一個字典，將簇的數目作為鍵並作為屬於該特定簇的所有注釋的列表，如下所示：

myDict = {2: ['hello This', 'I am using',...], 11: ['Fabiola hello', ...], ... }

在此示例中，由於簇的第一個標簽為2，因此第一個注釋被分配給我的字典，然后由於簇列表中的標簽為11，因此以下注釋被分配給簇11，則標簽為2，因此注釋'我正在使用”已分配給集群2的列表：

我試圖做到這一點如下：

dict_clusters2 = {}
for i in range(0,len(y_pred)):
    #print(kmeans.labels_[i])
    #print(listComments[i])
    if not y_pred[i] in dict_clusters2:
        dict_clusters2[y_pred[i]] = []
    dict_clusters2[y_pred[i]].append(listComments[i])
print("dictionary constructed")

但是，由於這種原因，這種方法需要花費大量時間進行計算，因此，我想提出一個優化此過程的建議，非常感謝您的關注和支持，

我正在使用的python版本如下：

3.5.2 |Anaconda 4.2.0 (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
3.5.2

Answer 1

您正在為這兩個列表使用大量索引，其復雜度為O（N）和其他一些操作。 相反，您可以使用zip函數來創建包含數字和注釋對的生成器，然后使用collections.defaultdict()來創建所需的字典（您的情況正是該函數的目的）：

from collections import defaultdict
pairs = zip(y_pred, listComments)

dict_clusters2 = defaultdict(list)

for num, comment in pairs:
    dict_clusters2[num].append(comment)

如何優化以下字典的結構？

問題描述

1 個解決方案

解決方案1
1 已采納 2016-12-17 21:53:35

如何優化以下字典的結構？

問題描述

1 個解決方案

解決方案1 1 已采納 2016-12-17 21:53:35

解決方案1
1 已采納 2016-12-17 21:53:35