简体   繁体   English

在python列表中查找出现的快速方法

[英]fast way to find occurrences in a list in python

I have a set of unique words called h_unique . 我有一组称为h_unique的独特单词。 I also have a 2D list of documents called h_tokenized_doc which has a structure like: 我还有一个二维文档列表,称为h_tokenized_doc ,其结构如下:

[ ['hello', 'world', 'i', 'am'], 
  ['hello', 'stackoverflow', 'i', 'am'], 
  ['hello', 'world', 'i', 'am', 'mr'], 
  ['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ]

and h_unique as: h_unique为:

('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm')

what I want is to find the occurrences of the unique words in the tokenized documents list. 我想要的是在标记化文档列表中查找唯一单词的出现。
So far I came up with this code but this seems to be VERY slow. 到目前为止,我想出了这段代码,但这似乎慢。 Is there any efficient way to do this? 有什么有效的方法可以做到这一点吗?

term_id = []
for term in h_unique:
    print term
    for doc_id, doc in enumerate(h_tokenized_doc):
        term_id.append([doc_id for t in doc if t == term])

In my case I have a document list of 7000 documents, structured like: 在我的情况下,我有7000个文档的文档列表,结构如下:

[ [doc1], [doc2], [doc3], ..... ]

It'll be slow because you're running through your entire document list once for every unique word. 这会很慢,因为每个唯一的单词都要遍历整个文档列表一次。 Why not try storing the unique words in a dictionary and appending to it for each word found? 为什么不尝试将唯一的单词存储在词典中,并为找到的每个单词附加到词典之后?

unique_dict = {term: [] for term in h_unique}
for doc_id, doc in enumerate(h_tokenized_doc):
    for term_id, term in enumerate(doc):
        try:
            # Not sure what structure you want to keep it in here...
            # This stores a tuple of the doc, and position in that doc
            unique_dict[term].append((doc_id, term_id))
        except KeyError:
            # If the term isn't in h_unique, don't do anything
            pass

This runs through all the document's only once. 这仅对所有文档运行一次。

From your above example, unique_dict would be: 从上面的示例中, unique_dict将为:

{'pycharm': [], 'i': [(0, 2), (1, 2), (2, 2), (3, 2)], 'stackoverflow': [(1, 1), (3, 1)], 'am': [(0, 3), (1, 3), (2, 3), (3, 3)], 'mr': [(2, 4)], 'world': [(0, 1), (2, 1)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0)]}

(Of course assuming the typo 'pycahrm' in your example was deliberate) (当然,假设您的示例中有错字'pycahrm'

term_id.append([doc_id for t in doc if t == term])

This will not append one doc_id for each matching term; 不会为每个匹配项附加一个doc_id it will append an entire list of potentially many identical values of doc_id . 它将附加可能包含doc_id许多相同值的完整列表。 Surely you did not mean to do this. 当然,您并不是要这样做。

Based on your sample code, term_id ends up as this: 根据您的示例代码, term_id最终如下所示:

[[0], [1], [2], [3], [0], [], [2], [], [0], [1], [2], [3], [0], [1], [2], [3], [], [1], [], [3], [], [], [2], [], [], [], [], []]

Is this really what you intended? 这真的是您想要的吗?

If I understood correctly, and based on your comment to the question where you say 如果我理解正确,并根据您对问题的评论

yes because a single term may appear in multiple docs like in the above case for hello the result is [0,1, 2, 3] and for world it is [0, 2] 是的,因为单个术语可能会出现在多个文档中,例如在上述情况下,打招呼的结果是[0,1、2、3],而世界的结果是[0,2]

it looks like what you wanna do is: For each of the words in the h_unique list (which, as mentioned, should be a set , or keys in a dict , which both have a search access of O(1) ), go through all the lists contained in the h_tokenized_doc variable and find the indexes in which of those lists the word appears. 看来您想要执行的操作是:对于h_unique列表中的每个单词(如前所述,它们应该是set ,或者是dict键,它们都具有O(1)的搜索访问权限),请遍历h_tokenized_doc变量中包含的所有列表,并找到单词出现在那些列表中的索引。

IF that's actually what you want to do, you could do something like the following: 如果这实际上是您想要执行的操作,则可以执行以下操作:

#!/usr/bin/env python

h_tokenized_doc = [['hello', 'world', 'i', 'am'],
                   ['hello', 'stackoverflow', 'i', 'am'],
                   ['hello', 'world', 'i', 'am', 'mr'],
                   ['hello', 'stackoverflow', 'i', 'am', 'pycahrm']]

h_unique = ['hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm']

# Initialize a dict with empty lists as the value and the items 
# in h_unique the keys
results = {k: [] for k in h_unique}

for i, line in enumerate(h_tokenized_doc):
    for k in results:
        if k in line:
            results[k].append(i)
print results

Which outputs: 哪个输出:

{'pycharm': [], 'i': [0, 1, 2, 3], 'stackoverflow': [1, 3], 'am': [0, 1, 2, 3], 'mr': [2], 'world': [0, 2], 'hello': [0, 1, 2, 3]}

The idea is using the h_unique list as keys in a dictionary (the results = {k: [] for k in h_unique} part). 这个想法是使用h_unique列表作为字典中的键( results = {k: [] for k in h_unique}部分)。

Keys in dictionaries have the advantage of a constant lookup time, which is great for the if k in line: part (if it were a list, that in would take O(n) ) and then check if the word (the key k ) appears in the list. 字典中的键具有恒定的查找时间的优势,这对于if k in line: part(如果是列表,则in将采用O(n) )非常O(n) ,然后检查单词(键k )出现在列表中。 If it does, append the index of the list within the matrix to the dictionary of results. 如果是这样,请将listmatrix内的索引附加到结果字典中。

Although I'm not certain this is what you want to achieve, though. 尽管我不确定这是您要实现的目标。

You can optimize your code to do the trick with 您可以优化代码来解决问题

  1. Using just a single for loop 仅使用一个for循环
  2. Generators dictionaries for constant lookup time, as suggested previously. 如前所述,生成器字典用于恒定的查找时间。 Generators are faster than for loops because the generate values on the fly 发电机比for循环更快,因为在运行中生成值

     In [75]: h_tokenized_doc = [ ['hello', 'world', 'i', 'am'], ...: ['hello', 'stackoverflow', 'i', 'am'], ...: ['hello', 'world', 'i', 'am', 'mr'], ...: ['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ] In [76]: h_unique = ('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm') In [77]: term_id = {k: [] for k in h_unique} In [78]: for term in h_unique: ...: term_id[term].extend(i for i in range(len(h_tokenized_doc)) if term in h_tokenized_doc[i]) 

    which yields the output 产生输出

     {'am': [0, 1, 2, 3], 'hello': [0, 1, 2, 3], 'i': [0, 1, 2, 3], 'mr': [2], 'pycharm': [], 'stackoverflow': [1, 3], 'world': [0, 2]} 

A more descriptive solution would be 更具描述性的解决方案是

In [79]: for term in h_unique:
    ...:     term_id[term].extend([(i,h_tokenized_doc[i].index(term)) for i in range(len(h_tokenized_doc)) if term in h_tokenized_doc[i]])


In [80]: term_id
Out[80]: 
{'am': [(0, 3), (1, 3), (2, 3), (3, 3)],
 'hello': [(0, 0), (1, 0), (2, 0), (3, 0)],
 'i': [(0, 2), (1, 2), (2, 2), (3, 2)],
 'mr': [(2, 4)],
 'pycharm': [],
 'stackoverflow': [(1, 1), (3, 1)],
 'world': [(0, 1), (2, 1)]}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM