在python列表中查找出現的快速方法

Question

我有一組稱為h_unique的獨特單詞。 我還有一個二維文檔列表，稱為h_tokenized_doc ，其結構如下：

[ ['hello', 'world', 'i', 'am'], 
  ['hello', 'stackoverflow', 'i', 'am'], 
  ['hello', 'world', 'i', 'am', 'mr'], 
  ['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ]

和h_unique為：

('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm')

我想要的是在標記化文檔列表中查找唯一單詞的出現。
到目前為止，我想出了這段代碼，但這似乎很慢。 有什么有效的方法可以做到這一點嗎？

term_id = []
for term in h_unique:
    print term
    for doc_id, doc in enumerate(h_tokenized_doc):
        term_id.append([doc_id for t in doc if t == term])

在我的情況下，我有7000個文檔的文檔列表，結構如下：

[ [doc1], [doc2], [doc3], ..... ]

Answer 1

這會很慢，因為每個唯一的單詞都要遍歷整個文檔列表一次。 為什么不嘗試將唯一的單詞存儲在詞典中，並為找到的每個單詞附加到詞典之后？

unique_dict = {term: [] for term in h_unique}
for doc_id, doc in enumerate(h_tokenized_doc):
    for term_id, term in enumerate(doc):
        try:
            # Not sure what structure you want to keep it in here...
            # This stores a tuple of the doc, and position in that doc
            unique_dict[term].append((doc_id, term_id))
        except KeyError:
            # If the term isn't in h_unique, don't do anything
            pass

這僅對所有文檔運行一次。

從上面的示例中， unique_dict將為：

{'pycharm': [], 'i': [(0, 2), (1, 2), (2, 2), (3, 2)], 'stackoverflow': [(1, 1), (3, 1)], 'am': [(0, 3), (1, 3), (2, 3), (3, 3)], 'mr': [(2, 4)], 'world': [(0, 1), (2, 1)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0)]}

（當然，假設您的示例中有錯字'pycahrm' ）

Answer 2

term_id.append([doc_id for t in doc if t == term])

不會為每個匹配項附加一個doc_id ； 它將附加可能包含doc_id許多相同值的完整列表。 當然，您並不是要這樣做。

根據您的示例代碼， term_id最終如下所示：

[[0], [1], [2], [3], [0], [], [2], [], [0], [1], [2], [3], [0], [1], [2], [3], [], [1], [], [3], [], [], [2], [], [], [], [], []]

這真的是您想要的嗎？

Answer 3

如果我理解正確，並根據您對問題的評論

是的，因為單個術語可能會出現在多個文檔中，例如在上述情況下，打招呼的結果是[0,1、2、3]，而世界的結果是[0,2]

看來您想要執行的操作是：對於h_unique列表中的每個單詞（如前所述，它們應該是set ，或者是dict鍵，它們都具有O(1)的搜索訪問權限），請遍歷h_tokenized_doc變量中包含的所有列表，並找到單詞出現在那些列表中的索引。

如果這實際上是您想要執行的操作，則可以執行以下操作：

#!/usr/bin/env python

h_tokenized_doc = [['hello', 'world', 'i', 'am'],
                   ['hello', 'stackoverflow', 'i', 'am'],
                   ['hello', 'world', 'i', 'am', 'mr'],
                   ['hello', 'stackoverflow', 'i', 'am', 'pycahrm']]

h_unique = ['hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm']

# Initialize a dict with empty lists as the value and the items 
# in h_unique the keys
results = {k: [] for k in h_unique}

for i, line in enumerate(h_tokenized_doc):
    for k in results:
        if k in line:
            results[k].append(i)
print results

哪個輸出：

{'pycharm': [], 'i': [0, 1, 2, 3], 'stackoverflow': [1, 3], 'am': [0, 1, 2, 3], 'mr': [2], 'world': [0, 2], 'hello': [0, 1, 2, 3]}

這個想法是使用h_unique列表作為字典中的鍵（ results = {k: [] for k in h_unique}部分）。

字典中的鍵具有恆定的查找時間的優勢，這對於if k in line: part（如果是列表，則in將采用O(n) ）非常O(n) ，然后檢查單詞（鍵k ）出現在列表中。 如果是這樣，請將list在matrix內的索引附加到結果字典中。

盡管我不確定這是您要實現的目標。

Answer 4

您可以優化代碼來解決問題

僅使用一個for循環

如前所述，生成器字典用於恆定的查找時間。 發電機比for循環更快，因為在運行中生成值

 In [75]: h_tokenized_doc = [ ['hello', 'world', 'i', 'am'], ...: ['hello', 'stackoverflow', 'i', 'am'], ...: ['hello', 'world', 'i', 'am', 'mr'], ...: ['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ] In [76]: h_unique = ('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm') In [77]: term_id = {k: [] for k in h_unique} In [78]: for term in h_unique: ...: term_id[term].extend(i for i in range(len(h_tokenized_doc)) if term in h_tokenized_doc[i])

產生輸出

 {'am': [0, 1, 2, 3], 'hello': [0, 1, 2, 3], 'i': [0, 1, 2, 3], 'mr': [2], 'pycharm': [], 'stackoverflow': [1, 3], 'world': [0, 2]}

更具描述性的解決方案是

In [79]: for term in h_unique:
    ...:     term_id[term].extend([(i,h_tokenized_doc[i].index(term)) for i in range(len(h_tokenized_doc)) if term in h_tokenized_doc[i]])


In [80]: term_id
Out[80]: 
{'am': [(0, 3), (1, 3), (2, 3), (3, 3)],
 'hello': [(0, 0), (1, 0), (2, 0), (3, 0)],
 'i': [(0, 2), (1, 2), (2, 2), (3, 2)],
 'mr': [(2, 4)],
 'pycharm': [],
 'stackoverflow': [(1, 1), (3, 1)],
 'world': [(0, 1), (2, 1)]}

在python列表中查找出現的快速方法

問題描述

4 個解決方案

解決方案1
2 已采納 2016-10-22 22:43:22

解決方案2
1 2016-10-22 22:23:04

解決方案3
1 2016-10-22 22:40:03

解決方案4
1 2016-10-22 22:41:46

在python列表中查找出現的快速方法

問題描述

4 個解決方案

解決方案1 2 已采納 2016-10-22 22:43:22

解決方案2 1 2016-10-22 22:23:04

解決方案3 1 2016-10-22 22:40:03

解決方案4 1 2016-10-22 22:41:46

解決方案1
2 已采納 2016-10-22 22:43:22

解決方案2
1 2016-10-22 22:23:04

解決方案3
1 2016-10-22 22:40:03

解決方案4
1 2016-10-22 22:41:46