简体   繁体   English

使用 Python 查找字符串中列表的出现次数

[英]Find number of occurrences of a list in a string using Python

I have a list containing several thousand short strings and a .csv file containing several hundred thousand short strings.我有一个包含数千个短字符串的列表和一个包含数十万个短字符串的 .csv 文件。 All list elements are unique.所有列表元素都是唯一的。 For each string in the .csv file, I need to check to see if it contains more than one list element.对于 .csv 文件中的每个字符串,我需要检查它是否包含多个列表元素。

For example.例如。 I have a string:我有一个字符串:

example_string = "mermaids have braids and tails"

And a list:和一个列表:

example_list = ["me", "ve", "az"]

Clearly the example string contains more than one list item;很明显,示例字符串包含多个列表项; me and ve.我和我。 My code needs to indicate this.我的代码需要指出这一点。 However, if the list was但是,如果列表是

example_list = ["ai", "az", "nr"]

only one list element is contained.只包含一个列表元素。

I think that the following code will check to see if each line in my .csv file contains at least one list element.我认为以下代码将检查我的 .csv 文件中的每一行是否至少包含一个列表元素。 However, that doesn't tell me if it contains more than one different list element.但是,这并不能告诉我它是否包含多个不同的列表元素。

data = file("my_file_of_strings.csv", "r").readlines()
for line in data:       
    if any(item in my_list for i in line):
        #Do something#
with open("my_file_of_strings.csv", "r") as data:
    for line in data:       
        if any(item in i for i in line.split() for item in my_list):
            ...

If you need to count them use sum()如果您需要计算它们,请使用sum()

with open("my_file_of_strings.csv", "r") as data:
    for line in data:       
        result = sum(item in i for i in line.split() for item in my_list):
def contains_multiple(string, substrings):
    count = 0

    for substring in substrings:
        if substring in string:
            count += 1
            if count > 1:
                return True

    return False

for line in data:
    if contains_multiple(line, my_list):
        ...

Not short, but it will exit early as soon as it finds the 2nd match.不短,但它会在找到第二个匹配项后立即退出。 That may or may not be an important optimization.这可能是也可能不是重要的优化。

Something like:就像是:

data = file("my_file_of_strings.csv", "r").readlines()
for line in data:       
    if len(set(item for item in my_list if item in line)) > 1:
        #Do something#

I think the other solutions are better for your purpose, but in case you want to keep track of the number of hits and which ones they were, you could try this:我认为其他解决方案更适合您的目的,但如果您想跟踪点击次数以及点击次数,您可以尝试以下方法:

In [14]: from collections import defaultdict

In [15]: example_list = ["me", "ve", "az"]

In [16]: example_string = "mermaids have braids and tails"

In [17]: d = defaultdict(int)

In [18]: for i in example_list:
   ....:     d[i] += example_string.count(i)
   ....:

In [19]: d
Out[19]: defaultdict(<type 'int'>, {'me': 1, 'az': 0, 've': 1})

And then to get the total number of unique matches:然后获取唯一匹配项的总数:

In [20]: matches = sum(1 for v in d.values() if v)

In [21]: matches
Out[21]: 2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM