简体   繁体   English

Python 匹配来自字典问题的各种关键字

[英]Python matching various keyword from dictionary issues

I have a complex text where I am categorizing different keywords stored in a dictionary:我有一个复杂的文本,我在其中对存储在字典中的不同关键字进行分类:

    text = 'data-ls-static="1">Making Bio Implants, Drug Delivery and 3D Printing in Medicine,MEDICINE</h3>'

    sector = {"med tech": ['Drug Delivery' '3D printing', 'medicine', 'medical technology', 'bio cell']}

this can successfully find my keywords and categorize them with some limitations:这可以成功找到我的关键字并将它们分类有一些限制:

    pattern = r'[a-zA-Z0-9]+'

    [cat for cat in sector if any(x in re.findall(pattern,text) for x in sector[cat])]

The limitations that I cannot solve are:我无法解决的限制是:

  1. For example, keywords like "Drug Delivery" that are separated by a space are not recognized and therefore categorized.例如,以空格分隔的“Drug Delivery”等关键字无法识别,因此无法分类。

  2. I was not able to make the pattern case insensitive, as words like MEDICINE are not recognized.我无法使模式不区分大小写,因为像 MEDICINE 这样的词无法识别。 I tried to add (?i) to the pattern but it doesn't work.我试图将 (?i) 添加到模式中,但它不起作用。

  3. The categorized keywords go into a pandas df, but they are printed into [].分类后的关键字 go 变成了 pandas df,但是它们被打印到 [] 中。 I tried to loop again the script to take them out but they are still there.我试图再次循环脚本以将它们取出,但它们仍然存在。

Data to pandas df: pandas df 的数据:

    ind_list = []
    for site in url_list:
        ind = [cat for cat in indication if any(x in re.findall(pattern,soup_string) for x in indication[cat])]
        ind_list.append(ind)

    websites['Indication'] = ind_list

Current output:当前 output:

Website                                  Sector                              Sub-sector                                 Therapeutical Area Focus URL status
0     url3.com                              [med tech]                                      []                                                 []          []         []
1     www.url1.com                    [med tech, services]                                      []                       [oncology, gastroenterology]          []         []
2     www.url2.com                    [med tech, services]                                      []                                        [orthopedy]          []         []

In the output I get [] that I'd like to avoid.在 output 我得到 [] 我想避免。

Can you help me with these points?你能帮我解决这些问题吗?

Thanks!谢谢!

findall is pretty wasteful here since you are repeatedly breaking up the string for each keyword. findall在这里非常浪费,因为您反复分解每个关键字的字符串。

If you want to test whether the keyword is in the string:如果要测试关键字是否在字符串中:

[cat for cat in sector if any(re.search(word, text, re.I) for word in sector[cat])]
# Output: med tech

Give you some hints here the problem that can readily be spot:在这里给你一些提示可以很容易地发现问题:

  1. Why can't match keywords like "Drug Delivery" that are separated by a space?为什么不能匹配以空格分隔的“Drug Delivery”等关键字? This is because the regex pattern r'[a-zA-Z0-9]+' does not match for a space.这是因为正则表达式模式r'[a-zA-Z0-9]+'不匹配空格。 You can change it to r'[a-zA-Z0-9 ]+' (added a space after 9 ) if you want to match also for a space.如果您还想匹配空格,可以将其更改为r'[a-zA-Z0-9 ]+' (在9之后添加一个空格)。 However, if you want to support other types of white spaces (eg \t, \n), you need to further change this regex pattern.但是,如果您想支持其他类型的空格(例如 \t、\n),则需要进一步更改此正则表达式模式。

  2. Why don't support case insensitive match?为什么不支持不区分大小写的匹配? Your code fragment any(x in re.findall(pattern,text) for x in sector[cat]) requires x to have the same upper/lower case for BOTH being in result of re.findall and being in sector[cat] .您的代码片段any(x in re.findall(pattern,text) for x in sector[cat])要求x具有相同的大写/小写,因为两者都是re.findall的结果和在sector[cat]中。 This constrain even cannot be bypassed by setting flags=re.I in the re.findall() call.这个约束甚至不能通过在re.findall()调用中设置flags=re.I来绕过。 Suggest you to convert them all to the same case before checking.建议您在检查之前将它们全部转换为相同的大小写。 That is, for example change them all to lower cases before matching: any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat]) Here we added .lower() to both text and x.lower() .也就是说,例如在匹配之前将它们全部更改为小写: any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat])这里我们添加了.lower() textx.lower()

With the above 2 changes, it should allow you to capture some categorized keywords.通过上述 2 项更改,它应该允许您捕获一些分类的关键字。

Actually, for this particular case, you may not need to use regular expression and re.findall at all.实际上,对于这种特殊情况,您可能根本不需要使用正则表达式和re.findall You may just check eg sector[cat][i].lower()) in text.lower() .您可以在 text.lower() 中检查例如sector[cat][i].lower()) in text.lower() That is, change the list comprehension as follows:也就是说,如下更改列表推导:

[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]

Edit编辑

Test Run with 2-word phrase:使用 2 个单词的短语进行测试运行:

text = 'drug delivery'
sector = {"med tech": ['Drug Delivery', '3D printing', 'medicine', 'medical technology', 'bio cell']}
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]

Output:       # Successfully got the categorizing keyword even with dictionary values of different upper/lower cases
['med tech']

text = 'Drug Store fast delivery'
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]

Ouptput:    # Correctly doesn't match with extra words in between 

[]

Can you try a different approach other than regex,您可以尝试除正则表达式之外的其他方法吗?
I would suggest difflib when you have two similar matching words.当您有两个相似的匹配词时,我会建议difflib

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM