在python中更新嵌套列表需要太长时间

Question

I am trying to implement brown clustering algorithm in python. 我试图在python中实现棕色聚类算法。

I have data structure of cluster = List[List] 我有cluster = List [List]的数据结构

At any gives time the outside list length will be maximum 40 or 41. 在任何给定时间，外部列表长度将最多为40或41。

But internal list contains english words such as 'the', 'hello' etc 但内部列表包含英语单词，如'the'，'hello'等

So I have total of words 8000(vocabulary) and initially first 40 words are put into cluster. 所以我总共有8000个单词（词汇表），最初将40个单词放入群集中。

I iterate over my vocabulary from 41 to 8000 # do some compution this takes very less times. 我将我的词汇量从41重复到8000＃＃做一些计算，这需要的时间少得多。 # Merge 2 item in list and delete one item from list # ex: if c1 and c2 are items of clusters then ＃合并列表中的2项并从列表中删除一项#ex：如果c1和c2是簇的项目

for i in range(41, 8000):
    clusters.append(vocabulary[i])
    c1 = computation 1
    c2 = computation 2
    clusters[c1] = clusters[c1] + clusters[c2]
    del clusters[c2]

But the time takes for line clusters[c1] = clusters[c1] + clusters[c1] grows gradually as i iterate over my vocabulary. 但是，当我迭代我的词汇时，线簇[c1] =簇[c1] +簇[c1]的时间逐渐增长。 Initially for 41-50 is it 1sec, but for every 20 items in vocabulary the time grows by 1 sec. 最初为41-50是1秒，但对于词汇表中的每20个项目，时间增加1秒。

On commenting just clusters[c1] = clusters[c1] + clusters[c1] from my entire code, i observer all iterations takes constant time. 在我的整个代码中只注释簇[c1] = clusters [c1] + clusters [c1]时，我观察者所有的迭代都需要恒定的时间。 I am not sure how can i speed up this process. 我不知道我怎么能加快这个过程。

for i in range(41, 8000):
    clusters.append(vocabulary[i])
    c1 = computation 1
    c2 = computation 2
    #clusters[c1] = clusters[c1] + clusters[c2]
    del clusters[c2]

I am new to stackoverflow, please excuse me if any incorrect formatting here. 我是stackoverflow的新手，如果这里有任何不正确的格式，请原谅。

Thanks 谢谢

Answer 1

The problem you're running into is that list concatenation is a linear time operation. 您遇到的问题是列表连接是线性时间操作。 Thus, your entire loop is O(n^2) (and that's prohibitively slow for n much larger than 1000). 因此，你的整个循环是O(n^2) （对于远大于1000的n来说，这是非常慢的）。 This is ignoring how copying such large lists can be bad for cache performance, etc. 这忽略了复制这些大型列表对缓存性能等的影响。

Disjoint Set data structure 不相交集数据结构

The solution I recommend is to use a disjoint set datastructure. 我建议的解决方案是使用不相交的集数据结构。 This is a tree-based datastructure that "self-flattens" as you perform queries, resulting in a very fast runtimes for "merging" clusters. 这是一个基于树的数据结构，在执行查询时会“自我展平”，从而导致“合并”集群的运行时间非常快。

The basic idea is that each word starts off as its own "singleton" tree, and merging clusters consists of making the root of one tree the child of another. 基本思想是每个单词都以它自己的“单例”树开始，而合并集群包括使一棵树的根成为另一棵树的子。 This repeats (with some care for balancing) until you have as many clusters as desired. 这会重复（需要注意平衡），直到您拥有所需数量的簇。

I've written an example implementation (GitHub link) that assumes elements of each set are numbers. 我编写了一个示例实现（GitHub链接），假设每个集合的元素都是数字。 As long as you have a mapping from vocabulary terms to integers, it should work just fine for your purposes. 只要你有从词汇术语到整数的映射，它应该适合你的目的。 (Note: I've done some preliminary testing, but I wrote it in 5 minutes right now so I'd recommend checking my work. ;) ) （注意：我做了一些初步测试，但我现在在5分钟内写完了所以我建议检查我的工作。 ;) ）

To use in your code, I would do something like the following: 要在代码中使用，我会执行以下操作：

clusters = DisjointSet(8000)
# some code to merge the first 40 words into clusters
for i in range(41, 8000):
    c1 = some_computation() # assuming c1 is a number
    c2 = some_computation() # assuming c2 is a number
    clusters.join(c1, c2)

# Now, if you want to determine if some word with number k is 
# in the same cluster as a word with number j:
print("{} and {} are in the same cluster? {}".format(j, k, clusters.query(j, k))

Regarding Sets vs Lists 关于集合与列表

While sets provide faster access time than lists, they actually have worse runtime when copying. 虽然集合提供比列表更快的访问时间 ，但实际上它们在复制时的运行时间更差 This makes sense in theory, because a set object actually has to allocate and assign more memory space than a list for an appropriate load factor. 这在理论上是有道理的，因为set对象实际上必须为适当的加载因子分配和分配更多的内存空间。 Also, it is likely inserting so many items could result in a "rehash" of the entire hash table, which is a quadratic-time operation in worst-case. 此外，它可能插入如此多的项目可能导致整个哈希表的“重新散列”，这是最坏情况下的二次时操作。

However, practice is what we're concerned with now, so I ran a quick experiment to determine exactly how worse off sets were than lists. 然而，实践是我们现在所关注的，所以我进行了一个快速实验，以确定关闭集合的确切程度。

Code for performing this test, in case anyone was interested, is below. 如果有人感兴趣，执行此测试的代码如下。 I'm using the Intel packaging of Python, so my performance may be slightly faster than on your machine. 我正在使用英特尔的Python包装，因此我的性能可能比您的机器上的性能略快。

import time
import random
import numpy as np
import matplotlib.pyplot as plt

data = []
for trial in range(5):
    trial_data = []

    for N in range(0, 20000, 50):
        l1 = random.sample(range(1000000), N)
        l2 = random.sample(range(1000000), N)
        s1 = set(l1)
        s2 = set(l2)

        # Time to concatenate two lists of length N
        start_lst = time.clock()
        l3 = l1+l2
        stop_lst = time.clock()

        # Time to union two sets of length N
        start_set = time.clock()
        s3 = s1|s2
        stop_set  = time.clock()

        trial_data.append([N, stop_lst - start_lst, stop_set - start_set])
    data.append(trial_data)

# average the trials and plot
data_array = np.array(data)
avg_data = np.average(data_array, 0)

fig = plt.figure()
ax = plt.gca()
ax.plot(avg_data[:,0], avg_data[:,1], label='Lists')
ax.plot(avg_data[:,0], avg_data[:,2], label='Sets')
ax.set_xlabel('Length of set or list (N)')
ax.set_ylabel('Seconds to union or concat (s)')
plt.legend(loc=2)
plt.show()

在python中更新嵌套列表需要太长时间

问题描述

1 个解决方案

解决方案1
1 2017-04-19 23:26:55

Disjoint Set data structure 不相交集数据结构

Regarding Sets vs Lists 关于集合与列表

在python中更新嵌套列表需要太长时间

问题描述

1 个解决方案

解决方案1 1 2017-04-19 23:26:55

Disjoint Set data structure 不相交集数据结构

Regarding Sets vs Lists 关于集合与列表

解决方案1
1 2017-04-19 23:26:55