[英]Count occurrences of a couple of specific words
I have a list of words, lets say: ["foo", "bar", "baz"] and a large string in which these words may occur.我有一个单词列表,可以说: ["foo", "bar", "baz"] 和一个可能出现这些单词的大字符串。
I now use for every word in the list the "string".count("word") method.我现在对列表中的每个单词使用 "string".count("word") 方法。 This works OK, but seems rather inefficient.
这工作正常,但似乎效率很低。 For every extra word added to the list the entire string must be iterated over an extra time.
对于添加到列表中的每个额外单词,整个字符串必须额外迭代一段时间。
Is their any better method to do this, or should I implement a custom method which iterates over the large string a single time, checking for each character if one of the words in the list has been reached?他们是否有更好的方法来做到这一点,或者我应该实现一个自定义方法来迭代大字符串一次,检查每个字符是否已到达列表中的一个单词?
To be clear:要清楚:
Make a dict
-typed frequency table for your words, then iterate over the words in your string.为您的单词制作一个
dict
类型的频率表,然后遍历字符串中的单词。
vocab = ["foo", "bar", "baz"]
s = "foo bar baz bar quux foo bla bla"
wordcount = dict((x,0) for x in vocab)
for w in re.findall(r"\w+", s):
if w in wordcount:
wordcount[w] += 1
Edit : if the "words" in your list contain whitespace, you can instead build an RE out of them:编辑:如果列表中的“单词”包含空格,则可以用它们构建一个 RE:
from collections import Counter
vocab = ["foo bar", "baz"]
r = re.compile("|".join(r"\b%s\b" % w for w in vocab))
wordcount = Counter(re.findall(r, s))
Explanation: this builds the RE r'\\bfoo bar\\b|\\bbaz\\b'
from the vocabulary.说明:这将从词汇表中构建 RE
r'\\bfoo bar\\b|\\bbaz\\b'
。 findall
then finds the list ['baz', 'foo bar']
and the Counter
(Python 2.7+) counts the occurrence of each distinct element in it. findall
然后找到列表['baz', 'foo bar']
并且Counter
(Python 2.7+) 计算其中每个不同元素的出现次数。 Watch out that your list of words should not contain characters that are special to REs, such as ()[]\\
.请注意,您的单词列表不应包含 RE 特有的字符,例如
()[]\\
。
Presuming the words need to be found separately (that is, you want to count words as made by str.split()
):假设需要单独找到单词(也就是说,您要计算由
str.split()
单词):
Edit: as suggested in the comments, a Counter is a good option here:编辑:正如评论中所建议的,计数器是一个不错的选择:
from collections import Counter
def count_many(needles, haystack):
count = Counter(haystack.split())
return {key: count[key] for key in count if key in needles}
Which runs as so:运行如下:
count_many(["foo", "bar", "baz"], "testing somefoothing foo bar baz bax foo foo foo bar bar test bar test")
{'baz': 1, 'foo': 4, 'bar': 4}
Note that in Python <= 2.6(?) you will need to use return dict((key, count[key]) for key in count if key in needles)
due to the lack of dict comprehensions.请注意,在 Python <= 2.6(?) 中
return dict((key, count[key]) for key in count if key in needles)
由于缺乏 dict 理解return dict((key, count[key]) for key in count if key in needles)
您将需要使用return dict((key, count[key]) for key in count if key in needles)
。
Of course, another option is to simply return the whole Counter
object and only get the values you need when you need them, as it may not be a problem to have the extra values, depending on the situation.当然,另一种选择是简单地返回整个
Counter
对象,并仅在需要时获取所需的值,因为根据情况,拥有额外的值可能不是问题。
Old answer:旧答案:
from collections import defaultdict
def count_many(needles, haystack):
count = defaultdict(int)
for word in haystack.split():
if word in needles:
count[word] += 1
return count
Which results in:结果是:
count_many(["foo", "bar", "baz"], "testing somefoothing foo bar baz bax foo foo foo bar bar test bar test")
defaultdict(<class 'int'>, {'baz': 1, 'foo': 4, 'bar': 4})
If you greatly object to getting a defaultdict back (which you shouldn't, as it functions exactly the same as a dict when accessing), then you can do return dict(count)
instead to get a normal dictionary.如果您非常反对返回 defaultdict(您不应该这样做,因为它在访问时的功能与 dict 完全相同),那么您可以改为
return dict(count)
来获取普通字典。
The Counter
method doesn't work well for large vocabularies. Counter
方法不适用于大型词汇表。 In example below CountVectorizer
is many times faster:在下面的例子中,
CountVectorizer
快了很多倍:
import time
import random
longstring = ["foo", "bar", "baz", "qux", "thud"] * 100000
random.shuffle(longstring)
longstring = " ".join(longstring)
vocab = ["foo bar", "baz"] + ["nothing"+str(i) for i in range(100000)]
import re
from collections import Counter
tic = time.time()
r = re.compile("|".join(r"\b%s\b" % w for w in vocab))
wordcount = Counter(re.findall(r, longstring))
print(time.time() - tic)
from sklearn.feature_extraction.text import CountVectorizer
from numpy import array
tic = time.time()
vectorized = CountVectorizer(vocabulary=vocab, ngram_range=(1, 2)).fit([longstring]) # list strings contains 1 to 2 words
counts = vectorized.transform([longstring])
counts = array(counts.sum(axis=0))[0]
wordcount = {vocab[i]: counts[i] for i in range(len(vocab))}
print(time.time() - tic)
How long is your string and I understand that it is not constantly changing as your list of string is?你的字符串有多长,我知道它不会像你的字符串列表那样不断变化?
A good idea is to iterate over the words in the string and have dictionary for the words and increment the count for each word.一个好主意是迭代字符串中的单词并为单词设置字典并增加每个单词的计数。 With this in place.
有了这个。 You can then looking for the word in the list in the dictionary and output it's value which is the number of occurrence.
然后,您可以在字典的列表中查找单词并输出它的值,即出现次数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.