简体   繁体   English

Pythonic计算字符串列表中出现次数的方法

[英]Pythonic way to count occurrences from a list in a string

What's the best way to find the count of occurrences of strings from a list in a target string? 从目标字符串中的列表中查找字符串出现次数的最佳方法是什么? Specifically, I have a list : 具体来说,我有一个清单:

string_list = [
    "foo",
    "bar",
    "baz"
]

target_string = "foo bar baz bar"

# Trying to write this function!
count = occurrence_counter(target_string) # should return 4

I'd like to optimize to minimize speed and memory usage, if that makes a difference. 我想优化以最小化速度和内存使用,如果这有所不同。 In terms of size, I would expect that string_list may end up containing several hundred substrings. 就大小而言,我希望string_list最终可能包含数百个子串。

Another way using collelctions.Counter : 使用collelctions.Counter的另一种方法:

from collections import Counter
word_counts = Counter(target_string.split(' '))
total = sum(word_counts.get(w, 0)) for w in string_list)

This works! 这有效!

def occurrence_counter(target_string):
    return sum(map(lambda x: x in string_list, target_string.split(' ')))

The string gets split into tokens, then each token gets transformed into a 1 if it is in the list, a 0 otherwise. 字符串被分割成标记,然后每个标记在列表中变换为1,否则变为0。 The sum function, at last, sums those values. sum函数最后将这些值相加。

EDIT: also: 编辑:还:

def occurrence_counter(target_string):
    return len(list(filter(lambda x: x in string_list, target_string.split(' '))))

This Python3 should work: 这个Python3应该工作:

In [4]: string_list = [
   ...:     "foo",
   ...:     "bar",
   ...:     "baz"
   ...: ]
   ...: 
   ...: set_of_counted_word = set(string_list)
   ...: 
   ...: def occurrence_counter(target_str, words_to_count=set_of_counted_word):
   ...:     return sum(1 for word in target_str.strip().split()
   ...:                if word in words_to_count)
   ...: 
   ...: 
   ...: for target_string in ("foo bar baz bar", " bip foo bap foo dib baz   "):
   ...:     print("Input: %r -> Count: %i" % (target_string, occurrence_counter(target_string)))
   ...: 
   ...: 
Input: 'foo bar baz bar' -> Count: 4
Input: ' bip foo bap foo dib baz   ' -> Count: 3

In [5]:

You could use a variable to store a running count is you iterate through the list like so: 您可以使用变量来存储运行计数,如下所示迭代列表:

def occurence_counter(x):
    count = 0
    for y in x:
        count +=1
    return count

Another solution: 另一种方案:

def occurrence_counter(target_string, string_list):
    target_list = target_string.split(' ')
    return len([w for w in target_list if w in string_list])

Combo of sum and string.count : sumstring.count组合:

def counter(s, lst)
    return sum(s.count(sub) for sub in lst)

This will not count overlapping occurrences of the same pattern. 这不会计算相同模式的重叠出现次数。

You could use a Trie to convert your substrings to a regex pattern (eg (?:ba[rz]|foo) ) and parse your target_string : 您可以使用Trie将子字符串转换为正则表达式模式(例如(?:ba[rz]|foo) )并解析target_string

import re
from trie import Trie

trie = Trie()

substrings = [
    "foo",
    "bar",
    "baz"
]
for substring in substrings:
    trie.add(substring)
print(trie.pattern())
# (?:ba[rz]|foo)

target_string = "foo bar baz bar"
print(len(re.findall(trie.pattern(), target_string)))
# 4

The required library is here : trie.py 所需的库在这里: trie.py

It should be much faster than parsing the whole target_string for each substring , but it might not return the desired result for overlapping substrings. 它应该比为每个substring解析整个target_string快得多,但它可能不会为重叠的子字符串返回所需的结果。 It returns 2 for ["foo", "bar", "foobar"] and "foobar" . 它为["foo", "bar", "foobar"]"foobar"返回2

A related question was : " Speed up millions of regex replacements in Python 3 " : here's an answer with sets and one with a trie regex . 一个相关的问题是:“ 在Python 3中加速数百万的正则表达式替换 ”:这是一个集合答案一个带有trie正则表达式答案

I am not sure this is the most pythonic way, but you can try it: 我不确定这是最pythonic的方式,但你可以试试:

string_list_B = target_string.split(" ")
commonalities = set(string_list) - (set(string_list) - set(string_list_B))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM