在python列表中查找出现的快速方法

Question

I have a set of unique words called h_unique . 我有一组称为h_unique的独特单词。 I also have a 2D list of documents called h_tokenized_doc which has a structure like: 我还有一个二维文档列表，称为h_tokenized_doc ，其结构如下：

[ ['hello', 'world', 'i', 'am'], 
  ['hello', 'stackoverflow', 'i', 'am'], 
  ['hello', 'world', 'i', 'am', 'mr'], 
  ['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ]

and h_unique as: 和h_unique为：

('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm')

what I want is to find the occurrences of the unique words in the tokenized documents list. 我想要的是在标记化文档列表中查找唯一单词的出现。
So far I came up with this code but this seems to be VERY slow. 到目前为止，我想出了这段代码，但这似乎很慢。 Is there any efficient way to do this? 有什么有效的方法可以做到这一点吗？

term_id = []
for term in h_unique:
    print term
    for doc_id, doc in enumerate(h_tokenized_doc):
        term_id.append([doc_id for t in doc if t == term])

In my case I have a document list of 7000 documents, structured like: 在我的情况下，我有7000个文档的文档列表，结构如下：

[ [doc1], [doc2], [doc3], ..... ]

Answer 1

It'll be slow because you're running through your entire document list once for every unique word. 这会很慢，因为每个唯一的单词都要遍历整个文档列表一次。 Why not try storing the unique words in a dictionary and appending to it for each word found? 为什么不尝试将唯一的单词存储在词典中，并为找到的每个单词附加到词典之后？

unique_dict = {term: [] for term in h_unique}
for doc_id, doc in enumerate(h_tokenized_doc):
    for term_id, term in enumerate(doc):
        try:
            # Not sure what structure you want to keep it in here...
            # This stores a tuple of the doc, and position in that doc
            unique_dict[term].append((doc_id, term_id))
        except KeyError:
            # If the term isn't in h_unique, don't do anything
            pass

This runs through all the document's only once. 这仅对所有文档运行一次。

From your above example, unique_dict would be: 从上面的示例中， unique_dict将为：

{'pycharm': [], 'i': [(0, 2), (1, 2), (2, 2), (3, 2)], 'stackoverflow': [(1, 1), (3, 1)], 'am': [(0, 3), (1, 3), (2, 3), (3, 3)], 'mr': [(2, 4)], 'world': [(0, 1), (2, 1)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0)]}

(Of course assuming the typo 'pycahrm' in your example was deliberate) （当然，假设您的示例中有错字'pycahrm' ）

Answer 2

term_id.append([doc_id for t in doc if t == term])

This will not append one doc_id for each matching term; 不会为每个匹配项附加一个doc_id ； it will append an entire list of potentially many identical values of doc_id . 它将附加可能包含doc_id许多相同值的完整列表。 Surely you did not mean to do this. 当然，您并不是要这样做。

Based on your sample code, term_id ends up as this: 根据您的示例代码， term_id最终如下所示：

[[0], [1], [2], [3], [0], [], [2], [], [0], [1], [2], [3], [0], [1], [2], [3], [], [1], [], [3], [], [], [2], [], [], [], [], []]

Is this really what you intended? 这真的是您想要的吗？

Answer 3

If I understood correctly, and based on your comment to the question where you say 如果我理解正确，并根据您对问题的评论

yes because a single term may appear in multiple docs like in the above case for hello the result is [0,1, 2, 3] and for world it is [0, 2] 是的，因为单个术语可能会出现在多个文档中，例如在上述情况下，打招呼的结果是[0,1、2、3]，而世界的结果是[0,2]

it looks like what you wanna do is: For each of the words in the h_unique list (which, as mentioned, should be a set , or keys in a dict , which both have a search access of O(1) ), go through all the lists contained in the h_tokenized_doc variable and find the indexes in which of those lists the word appears. 看来您想要执行的操作是：对于h_unique列表中的每个单词（如前所述，它们应该是set ，或者是dict键，它们都具有O(1)的搜索访问权限），请遍历h_tokenized_doc变量中包含的所有列表，并找到单词出现在那些列表中的索引。

IF that's actually what you want to do, you could do something like the following: 如果这实际上是您想要执行的操作，则可以执行以下操作：

#!/usr/bin/env python

h_tokenized_doc = [['hello', 'world', 'i', 'am'],
                   ['hello', 'stackoverflow', 'i', 'am'],
                   ['hello', 'world', 'i', 'am', 'mr'],
                   ['hello', 'stackoverflow', 'i', 'am', 'pycahrm']]

h_unique = ['hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm']

# Initialize a dict with empty lists as the value and the items 
# in h_unique the keys
results = {k: [] for k in h_unique}

for i, line in enumerate(h_tokenized_doc):
    for k in results:
        if k in line:
            results[k].append(i)
print results

Which outputs: 哪个输出：

{'pycharm': [], 'i': [0, 1, 2, 3], 'stackoverflow': [1, 3], 'am': [0, 1, 2, 3], 'mr': [2], 'world': [0, 2], 'hello': [0, 1, 2, 3]}

The idea is using the h_unique list as keys in a dictionary (the results = {k: [] for k in h_unique} part). 这个想法是使用h_unique列表作为字典中的键（ results = {k: [] for k in h_unique}部分）。

Keys in dictionaries have the advantage of a constant lookup time, which is great for the if k in line: part (if it were a list, that in would take O(n) ) and then check if the word (the key k ) appears in the list. 字典中的键具有恒定的查找时间的优势，这对于if k in line: part（如果是列表，则in将采用O(n) ）非常O(n) ，然后检查单词（键k ）出现在列表中。 If it does, append the index of the list within the matrix to the dictionary of results. 如果是这样，请将list在matrix内的索引附加到结果字典中。

Although I'm not certain this is what you want to achieve, though. 尽管我不确定这是您要实现的目标。

Answer 4

You can optimize your code to do the trick with 您可以优化代码来解决问题

Using just a single for loop 仅使用一个for循环

Generators dictionaries for constant lookup time, as suggested previously. 如前所述，生成器字典用于恒定的查找时间。 Generators are faster than for loops because the generate values on the fly 发电机比for循环更快，因为在运行中生成值

 In [75]: h_tokenized_doc = [ ['hello', 'world', 'i', 'am'], ...: ['hello', 'stackoverflow', 'i', 'am'], ...: ['hello', 'world', 'i', 'am', 'mr'], ...: ['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ] In [76]: h_unique = ('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm') In [77]: term_id = {k: [] for k in h_unique} In [78]: for term in h_unique: ...: term_id[term].extend(i for i in range(len(h_tokenized_doc)) if term in h_tokenized_doc[i])

which yields the output 产生输出

 {'am': [0, 1, 2, 3], 'hello': [0, 1, 2, 3], 'i': [0, 1, 2, 3], 'mr': [2], 'pycharm': [], 'stackoverflow': [1, 3], 'world': [0, 2]}

A more descriptive solution would be 更具描述性的解决方案是

In [79]: for term in h_unique:
    ...:     term_id[term].extend([(i,h_tokenized_doc[i].index(term)) for i in range(len(h_tokenized_doc)) if term in h_tokenized_doc[i]])


In [80]: term_id
Out[80]: 
{'am': [(0, 3), (1, 3), (2, 3), (3, 3)],
 'hello': [(0, 0), (1, 0), (2, 0), (3, 0)],
 'i': [(0, 2), (1, 2), (2, 2), (3, 2)],
 'mr': [(2, 4)],
 'pycharm': [],
 'stackoverflow': [(1, 1), (3, 1)],
 'world': [(0, 1), (2, 1)]}

在python列表中查找出现的快速方法

问题描述

4 个解决方案

解决方案1
2 已采纳 2016-10-22 22:43:22

解决方案2
1 2016-10-22 22:23:04

解决方案3
1 2016-10-22 22:40:03

解决方案4
1 2016-10-22 22:41:46

在python列表中查找出现的快速方法

问题描述

4 个解决方案

解决方案1 2 已采纳 2016-10-22 22:43:22

解决方案2 1 2016-10-22 22:23:04

解决方案3 1 2016-10-22 22:40:03

解决方案4 1 2016-10-22 22:41:46

解决方案1
2 已采纳 2016-10-22 22:43:22

解决方案2
1 2016-10-22 22:23:04

解决方案3
1 2016-10-22 22:40:03

解决方案4
1 2016-10-22 22:41:46