简体   繁体   English

计算几个特定单词的出现次数

[英]Count occurrences of a couple of specific words

I have a list of words, lets say: ["foo", "bar", "baz"] and a large string in which these words may occur.我有一个单词列表,可以说: ["foo", "bar", "baz"] 和一个可能出现这些单词的大字符串。

I now use for every word in the list the "string".count("word") method.我现在对列表中的每个单词使用 "string".count("word") 方法。 This works OK, but seems rather inefficient.这工作正常,但似乎效率很低。 For every extra word added to the list the entire string must be iterated over an extra time.对于添加到列表中的每个额外单词,整个字符串必须额外迭代一段时间。

Is their any better method to do this, or should I implement a custom method which iterates over the large string a single time, checking for each character if one of the words in the list has been reached?他们是否有更好的方法来做到这一点,或者我应该实现一个自定义方法来迭代大字符串一次,检查每个字符是否已到达列表中的一个单词?

To be clear:要清楚:

  • I want the number of occurrences per word in the list.我想要列表中每个单词的出现次数。
  • The string to search in is different each time and consists of about 10000 chars要搜索的字符串每次都不同,大约由 10000 个字符组成
  • The list of words is constant单词列表是不变的
  • The words in the list of words can contain whitespace单词列表中的单词可以包含空格

Make a dict -typed frequency table for your words, then iterate over the words in your string.为您的单词制作一个dict类型的频率表,然后遍历字符串中的单词。

vocab = ["foo", "bar", "baz"]
s = "foo bar baz bar quux foo bla bla"

wordcount = dict((x,0) for x in vocab)
for w in re.findall(r"\w+", s):
    if w in wordcount:
        wordcount[w] += 1

Edit : if the "words" in your list contain whitespace, you can instead build an RE out of them:编辑:如果列表中的“单词”包含空格,则可以用它们构建一个 RE:

from collections import Counter

vocab = ["foo bar", "baz"]
r = re.compile("|".join(r"\b%s\b" % w for w in vocab))
wordcount = Counter(re.findall(r, s))

Explanation: this builds the RE r'\\bfoo bar\\b|\\bbaz\\b' from the vocabulary.说明:这将从词汇表中构建 RE r'\\bfoo bar\\b|\\bbaz\\b' findall then finds the list ['baz', 'foo bar'] and the Counter (Python 2.7+) counts the occurrence of each distinct element in it. findall然后找到列表['baz', 'foo bar']并且Counter (Python 2.7+) 计算其中每个不同元素的出现次数。 Watch out that your list of words should not contain characters that are special to REs, such as ()[]\\ .请注意,您的单词列表不应包含 RE 特有的字符,例如()[]\\

Presuming the words need to be found separately (that is, you want to count words as made by str.split() ):假设需要单独找到单词(也就是说,您要计算由str.split()单词):

Edit: as suggested in the comments, a Counter is a good option here:编辑:正如评论中所建议的,计数器是一个不错的选择:

from collections import Counter

def count_many(needles, haystack):
    count = Counter(haystack.split())
    return {key: count[key] for key in count if key in needles}

Which runs as so:运行如下:

count_many(["foo", "bar", "baz"], "testing somefoothing foo bar baz bax foo foo foo bar bar test bar test")
{'baz': 1, 'foo': 4, 'bar': 4}

Note that in Python <= 2.6(?) you will need to use return dict((key, count[key]) for key in count if key in needles) due to the lack of dict comprehensions.请注意,在 Python <= 2.6(?) 中return dict((key, count[key]) for key in count if key in needles)由于缺乏 dict 理解return dict((key, count[key]) for key in count if key in needles)您将需要使用return dict((key, count[key]) for key in count if key in needles)

Of course, another option is to simply return the whole Counter object and only get the values you need when you need them, as it may not be a problem to have the extra values, depending on the situation.当然,另一种选择是简单地返回整个Counter对象,并仅在需要时获取所需的值,因为根据情况,拥有额外的值可能不是问题。

Old answer:旧答案:

from collections import defaultdict

def count_many(needles, haystack):
    count = defaultdict(int)
    for word in haystack.split():
        if word in needles:
            count[word] += 1
    return count

Which results in:结果是:

count_many(["foo", "bar", "baz"], "testing somefoothing foo bar baz bax foo foo foo bar bar test bar test")
defaultdict(<class 'int'>, {'baz': 1, 'foo': 4, 'bar': 4})

If you greatly object to getting a defaultdict back (which you shouldn't, as it functions exactly the same as a dict when accessing), then you can do return dict(count) instead to get a normal dictionary.如果您非常反对返回 defaultdict(您不应该这样做,因为它在访问时的功能与 dict 完全相同),那么您可以改为return dict(count)来获取普通字典。

The Counter method doesn't work well for large vocabularies. Counter方法不适用于大型词汇表。 In example below CountVectorizer is many times faster:在下面的例子中, CountVectorizer快了很多倍:

import time
import random

longstring = ["foo", "bar", "baz", "qux", "thud"] * 100000
random.shuffle(longstring)
longstring = " ".join(longstring)
vocab = ["foo bar", "baz"] + ["nothing"+str(i) for i in range(100000)]

Testing:测试:

import re
from collections import Counter

tic = time.time()
r = re.compile("|".join(r"\b%s\b" % w for w in vocab))
wordcount = Counter(re.findall(r, longstring))
print(time.time() - tic)

870 seconds 870 秒

from sklearn.feature_extraction.text import CountVectorizer
from numpy import array

tic = time.time()
vectorized = CountVectorizer(vocabulary=vocab, ngram_range=(1, 2)).fit([longstring])  # list strings contains 1 to 2 words
counts = vectorized.transform([longstring])
counts = array(counts.sum(axis=0))[0]
wordcount = {vocab[i]: counts[i] for i in range(len(vocab))}
print(time.time() - tic)

1.17 seconds 1.17 秒

How long is your string and I understand that it is not constantly changing as your list of string is?你的字符串有多长,我知道它不会像你的字符串列表那样不断变化?

A good idea is to iterate over the words in the string and have dictionary for the words and increment the count for each word.一个好主意是迭代字符串中的单词并为单词设置字典并增加每个单词的计数。 With this in place.有了这个。 You can then looking for the word in the list in the dictionary and output it's value which is the number of occurrence.然后,您可以在字典的列表中查找单词并输出它的值,即出现次数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM