[英]fast way to find occurrences in a list in python
I have a set of unique words called h_unique
. 我有一组称为h_unique
的独特单词。 I also have a 2D list of documents called h_tokenized_doc
which has a structure like: 我还有一个二维文档列表,称为h_tokenized_doc
,其结构如下:
[ ['hello', 'world', 'i', 'am'],
['hello', 'stackoverflow', 'i', 'am'],
['hello', 'world', 'i', 'am', 'mr'],
['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ]
and h_unique
as: 和h_unique
为:
('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm')
what I want is to find the occurrences of the unique words in the tokenized documents list. 我想要的是在标记化文档列表中查找唯一单词的出现。
So far I came up with this code but this seems to be VERY slow. 到目前为止,我想出了这段代码,但这似乎很慢。 Is there any efficient way to do this? 有什么有效的方法可以做到这一点吗?
term_id = []
for term in h_unique:
print term
for doc_id, doc in enumerate(h_tokenized_doc):
term_id.append([doc_id for t in doc if t == term])
In my case I have a document list of 7000 documents, structured like: 在我的情况下,我有7000个文档的文档列表,结构如下:
[ [doc1], [doc2], [doc3], ..... ]
It'll be slow because you're running through your entire document list once for every unique word. 这会很慢,因为每个唯一的单词都要遍历整个文档列表一次。 Why not try storing the unique words in a dictionary and appending to it for each word found? 为什么不尝试将唯一的单词存储在词典中,并为找到的每个单词附加到词典之后?
unique_dict = {term: [] for term in h_unique}
for doc_id, doc in enumerate(h_tokenized_doc):
for term_id, term in enumerate(doc):
try:
# Not sure what structure you want to keep it in here...
# This stores a tuple of the doc, and position in that doc
unique_dict[term].append((doc_id, term_id))
except KeyError:
# If the term isn't in h_unique, don't do anything
pass
This runs through all the document's only once. 这仅对所有文档运行一次。
From your above example, unique_dict
would be: 从上面的示例中, unique_dict
将为:
{'pycharm': [], 'i': [(0, 2), (1, 2), (2, 2), (3, 2)], 'stackoverflow': [(1, 1), (3, 1)], 'am': [(0, 3), (1, 3), (2, 3), (3, 3)], 'mr': [(2, 4)], 'world': [(0, 1), (2, 1)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0)]}
(Of course assuming the typo 'pycahrm'
in your example was deliberate) (当然,假设您的示例中有错字'pycahrm'
)
term_id.append([doc_id for t in doc if t == term])
This will not append one doc_id
for each matching term; 不会为每个匹配项附加一个doc_id
; it will append an entire list of potentially many identical values of doc_id
. 它将附加可能包含doc_id
许多相同值的完整列表。 Surely you did not mean to do this. 当然,您并不是要这样做。
Based on your sample code, term_id
ends up as this: 根据您的示例代码, term_id
最终如下所示:
[[0], [1], [2], [3], [0], [], [2], [], [0], [1], [2], [3], [0], [1], [2], [3], [], [1], [], [3], [], [], [2], [], [], [], [], []]
Is this really what you intended? 这真的是您想要的吗?
If I understood correctly, and based on your comment to the question where you say 如果我理解正确,并根据您对问题的评论
yes because a single term may appear in multiple docs like in the above case for hello the result is [0,1, 2, 3] and for world it is [0, 2] 是的,因为单个术语可能会出现在多个文档中,例如在上述情况下,打招呼的结果是[0,1、2、3],而世界的结果是[0,2]
it looks like what you wanna do is: For each of the words in the h_unique
list (which, as mentioned, should be a set
, or keys in a dict
, which both have a search access of O(1)
), go through all the lists contained in the h_tokenized_doc
variable and find the indexes in which of those lists the word appears. 看来您想要执行的操作是:对于h_unique
列表中的每个单词(如前所述,它们应该是set
,或者是dict
键,它们都具有O(1)
的搜索访问权限),请遍历h_tokenized_doc
变量中包含的所有列表,并找到单词出现在那些列表中的索引。
IF that's actually what you want to do, you could do something like the following: 如果这实际上是您想要执行的操作,则可以执行以下操作:
#!/usr/bin/env python
h_tokenized_doc = [['hello', 'world', 'i', 'am'],
['hello', 'stackoverflow', 'i', 'am'],
['hello', 'world', 'i', 'am', 'mr'],
['hello', 'stackoverflow', 'i', 'am', 'pycahrm']]
h_unique = ['hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm']
# Initialize a dict with empty lists as the value and the items
# in h_unique the keys
results = {k: [] for k in h_unique}
for i, line in enumerate(h_tokenized_doc):
for k in results:
if k in line:
results[k].append(i)
print results
Which outputs: 哪个输出:
{'pycharm': [], 'i': [0, 1, 2, 3], 'stackoverflow': [1, 3], 'am': [0, 1, 2, 3], 'mr': [2], 'world': [0, 2], 'hello': [0, 1, 2, 3]}
The idea is using the h_unique
list as keys in a dictionary (the results = {k: [] for k in h_unique}
part). 这个想法是使用h_unique
列表作为字典中的键( results = {k: [] for k in h_unique}
部分)。
Keys in dictionaries have the advantage of a constant lookup time, which is great for the if k in line:
part (if it were a list, that in
would take O(n)
) and then check if the word (the key k
) appears in the list. 字典中的键具有恒定的查找时间的优势,这对于if k in line:
part(如果是列表,则in
将采用O(n)
)非常O(n)
,然后检查单词(键k
)出现在列表中。 If it does, append the index of the list
within the matrix
to the dictionary of results. 如果是这样,请将list
在matrix
内的索引附加到结果字典中。
Although I'm not certain this is what you want to achieve, though. 尽管我不确定这是您要实现的目标。
You can optimize your code to do the trick with 您可以优化代码来解决问题
Generators dictionaries for constant lookup time, as suggested previously. 如前所述,生成器字典用于恒定的查找时间。 Generators are faster than for loops because the generate values on the fly 发电机比for循环更快,因为在运行中生成值
In [75]: h_tokenized_doc = [ ['hello', 'world', 'i', 'am'], ...: ['hello', 'stackoverflow', 'i', 'am'], ...: ['hello', 'world', 'i', 'am', 'mr'], ...: ['hello', 'stackoverflow', 'i', 'am', 'pycahrm'] ] In [76]: h_unique = ('hello', 'world', 'i', 'am', 'stackoverflow', 'mr', 'pycharm') In [77]: term_id = {k: [] for k in h_unique} In [78]: for term in h_unique: ...: term_id[term].extend(i for i in range(len(h_tokenized_doc)) if term in h_tokenized_doc[i])
which yields the output 产生输出
{'am': [0, 1, 2, 3], 'hello': [0, 1, 2, 3], 'i': [0, 1, 2, 3], 'mr': [2], 'pycharm': [], 'stackoverflow': [1, 3], 'world': [0, 2]}
A more descriptive solution would be 更具描述性的解决方案是
In [79]: for term in h_unique:
...: term_id[term].extend([(i,h_tokenized_doc[i].index(term)) for i in range(len(h_tokenized_doc)) if term in h_tokenized_doc[i]])
In [80]: term_id
Out[80]:
{'am': [(0, 3), (1, 3), (2, 3), (3, 3)],
'hello': [(0, 0), (1, 0), (2, 0), (3, 0)],
'i': [(0, 2), (1, 2), (2, 2), (3, 2)],
'mr': [(2, 4)],
'pycharm': [],
'stackoverflow': [(1, 1), (3, 1)],
'world': [(0, 1), (2, 1)]}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.