如何查找字符串中子字符串的出现次数并将其存储到 Python 字典中？

Question

I have a problem, didn't know how to create a matrix我有一个问题，不知道如何创建矩阵

I have a dictionary of this type:我有一本这种类型的字典：

dico = {
"banana": "sp_345",
"apple": "ap_456",
"pear": "pe_645",

} }

and a file like that:和这样的文件：

sp_345_4567 pe_645_4567876  ap_456_45678    pe_645_4556789
sp_345_567  pe_645_45678
pe_645_45678    ap_456_345678
sp_345_56789    ap_456_345
pe_645_45678    ap_456_345678
sp_345_56789    ap_456_345
s45678  f45678  f456789 ap_456_52546135

What I want to do is to create a matrix where we find more than n times a value from the dictionary in the line.我想要做的是创建一个矩阵，在该矩阵中，我们从行中的字典中找到超过 n 倍的值。

This is how I want to proceed:这就是我想要继续的方式：

step 1 create a dictionary with the associated values and numbers of lines:第 1 步创建一个包含相关值和行数的字典：

Like that:像那样：

dictionary = {'1': 'sp_345_4567','pe_645_4567876', 'ap_456_45678', 'pe_645_4556789'; '2': 'sp_345_567', 'pe_645_45678'; '3:' 'pe_645_45678','ap_456_345678'; '4:' etc ..

Then I want to make a comparison between the values with my first dictionary called dico and see for example in the number of times the banana key appears in each line (and therefore do it for all the keys of my dictionary) except that the problem is that the values of my dico are not equal to those of my dictionary because they are followed by this pattern'_\w+''然后我想将值与我的第一个字典 dico 进行比较，并查看例如香蕉键出现在每一行中的次数（因此对我字典的所有键执行此操作），除了问题是我的 dico 的值不等于我的字典的值，因为它们后面跟着这个模式'_\w+''

The idea would be to make a final_dict that would look like this to be able to make a matrix at the end:这个想法是制作一个看起来像这样的final_dict，以便能够在最后制作一个矩阵：

final_dict = {'line1': 'Banana' : '1' ; 'Apple': '1'; 'Pear':2; 'line2': etc ...

Here is my code that don't work:这是我的代码不起作用：

import pprint
import re
import csv

dico = {
    "banana": "sp_345",
    "apple": "ap_456",
    "pear": "pe_645",
}

dictionary = {}
final_dict = {}
cnt = 0
with open("test.txt") as file :
    reader = csv.reader(file, delimiter ='\t')
    for li in reader:
        grp = li
        number = 1
        for li in reader:
            dictionary[number] = grp
            number += 1
            pprint.pprint(dictionary)
            number_fruit = {}
            for key1, val1 in dico.items():
                for key2, val2 in dictionary.items():
                     if val1 == val2+'_\w+':
                         final_dict[key1] = val2

Thanks for the help谢谢您的帮助

EDIT:编辑：

I've tried using a dict comprehension我试过使用字典理解

import csv
import re

dico = {
    "banana": "sp_345",
    "apple": "ap_456",
    "pear": "pe_645",
}

with open("test.txt") as file :
    reader = csv.reader(file, delimiter ='\t')
    for li in reader:
        pattern = re.search(dico["banana"]+"_\w+", str(li))
        if pattern:
            final_dict = {"line" + str(index + 1):{key:line.count(text) for key, text in dico.items()} for index, line in enumerate(reader)}
        print(final_dict)

But when I print my final dictionary, it only put 0 for banana...但是当我打印我的最终字典时，它只为香蕉放了 0 ......

{'line1': {'banana': 0, 'apple': 0, 'pear': 0}, 'line2': {'banana': 0, 'apple': 0, 'pear': 0}, 'line3': {'banana': 0, 'apple': 0, 'pear': 0}, 'line4': {'banana': 0, 'apple': 0, 'pear': 0}, 'line5': {'banana': 0, 'apple': 0, 'pear': 0}, 'line6': {'banana': 0, 'apple': 0, 'pear': 0}}

So yeah, now it looks like a bit more of what I wanted but the occurences doesn't rise.... :/ Maybe my condition should be inside the dict comprehension??所以，是的，现在它看起来更像是我想要的，但发生率并没有上升......：/也许我的情况应该在字典理解范围内？

Answer 1

Why it doesn't work为什么它不起作用

Your test你的测试

if val1 == val2+'_\w+':
    ...

doesn't work because you are testing string equality between val1 which could be "sp_345_4567" and val2+'_\w+' , which is a string and could be litterally "sp_345_\w+'" , and they are not equal.不起作用，因为您正在测试val1之间的字符串相等性，它可能是"sp_345_4567"和val2+'_\w+' ，它是一个字符串，可能是乱七八糟"sp_345_\w+'" ，它们不相等。

What you could do about it你能做些什么

You might want to test for containment, for example例如，您可能想要测试遏制

if val1 in val2:
    ...

You can check that "sp_345" in "sp_345_4567" returns true .您可以检查"sp_345" in "sp_345_4567"返回true 。

You might also want to actually count the number of times "sp_345" appears in another string, and you can do this using .count :您可能还想实际计算"sp_345"出现在另一个字符串中的次数，您可以使用.count来执行此操作：

"sp_345_567  pe_645_45678".count("sp_345") # returns 1
"sp_345_567  pe_645_45678".count("_") # returns 2

You could also do it using regular expressions as you've tried to:您也可以尝试使用正则表达式来执行此操作：

import re
pattern = "sp_345_" + "\\w+"

if re.match(pattern, "sp_345_4567"):
    # pattern was found! Do stuff here.
    pass

# alternatively:
print(re.findall(pattern, "sp_345_4567"))
# prints ['sp_345_4567']

How can you apply that to build your final_dict你如何应用它来构建你的final_dict

You can rewrite your code in a much simpler way using dictionary comprehension :您可以使用字典理解以更简单的方式重写代码：

import csv

dico = {
    "banana": "sp_345",
    "apple": "ap_456",
    "pear": "pe_645",
}

with open("test.txt") as file :
    reader = csv.reader(file, delimiter ='\t')
    final_dict = {"line" + str(index + 1):{key:line.count(text) for key, text in dico.items()} for index, line in enumerate(reader)}

I'm building an outer dictionary with keys like "line1" , "line2" ... and for each of them, the value is an inner dictionary with keys like "banana" or "apple" and each value is the number of times they appear on the line.我正在构建一个带有"line1" 、 "line2"类的键的外部字典......对于它们中的每一个，值都是一个带有"banana"或"apple"类的键的内部字典，每个值都是次数他们出现就行了。

If you want to know how many times the banana appears on line 4 , you'd use如果您想知道banana在第4行出现了多少次，您可以使用

print(final_dict["line4"]["banana"])

Note that I would recommend using a list rather than a dictionary to map results to line numbers, so that the previous query would become:请注意，我建议使用列表而不是字典将 map 结果转换为行号，以便前面的查询变为：

print(final_list[3]["banana"])

如何查找字符串中子字符串的出现次数并将其存储到 Python 字典中？

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-11-04 21:51:01

如何查找字符串中子字符串的出现次数并将其存储到 Python 字典中？

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-11-04 21:51:01

解决方案1
1 已采纳 2019-11-04 21:51:01