简体   繁体   English

Python-计算列表中字符串的单词频率,列表中单词的数量不同

[英]Python - count word frequency of string from list, number of words from list varies

I am trying to create a program that runs though a list of mental health terms, looks in a research abstract, and counts the number of times the word or phrase appears. 我正在尝试创建一个程序,该程序运行一系列心理健康术语,查找研究摘要,并对单词或短语出现的次数进行计数。 I can get this to work with single words, but I'm struggling to do this with multiple words. 我可以使用单个单词来完成此操作,但是我很难使用多个单词来执行此操作。 I tried using NLTK ngrams too, but since the number of words from the mental health list varies (ie, not all terms from the mental health list will be bigrams or trigrams), I couldn't get that to work either. 我也尝试过使用NLTK ngrams,但是由于心理健康列表中单词的数量各不相同(即,并非心理健康列表中的所有术语都是二元或三元组),所以我也无法使用它。

I want to emphasize that I know splitting each word will only allow single words to be counted, however, I'm just stuck on how to deal with a varying number of words from my list to count in the abstract. 我想强调的是,我知道拆分每个单词将只允许对单个单词进行计数,但是,我只是停留在如何处理列表中不同数量的单词以计入摘要中的问题上。

Thanks! 谢谢!

from collections import Counter

abstracts = ['This is a mental health abstract about anxiety and bipolar 
disorder as well as other things.', 'While this abstract is not about ptsd 
or any trauma-related illnesses, it does have a mental health focus.']

for x2 in abstracts:


    mh_terms = ['bipolar disorder', 'anxiety', 'substance abuse disorder', 
    'ptsd', 'schizophrenia', 'mental health']

    c = Counter(s.lower().replace('.', '') for s in x2.split())
    for term in mh_terms:
        term = term.replace(',','')
        term = term.replace('.','')
        xx = (term, c.get(term, 0))

    mh_total_occur = sum(c.get(v, 0) for v in mh_terms)
    print(mh_total_occur)

From my example, both abstracts are getting a count of 1, but I want a count of two. 在我的示例中,两个摘要的计数均为1,但我希望为2。

The problem is that you will never match "mental health" as you are only counting occurrences of single words split by the " " character. 问题是您将永远不会匹配“心理健康”,因为您只在统计单个单词的出现次数(用“”字符分开)。

I don't know if using a counter is the right solution here. 我不知道在这里使用计数器是否是正确的解决方案。 If you did need an highly scalable and indexable solution, then n-grams are probably the way to go, but for small to medium problems it should be pretty quick to use regex pattern matching. 如果确实需要高度可伸缩且可索引的解决方案,那么n-gram可能是解决之道,但是对于中小型问题,使用正则表达式模式匹配应该很快。

import re

abstracts = [
    'This is a mental health abstract about anxiety and bipolar disorder as well as other things.',
    'While this abstract is not about ptsd or any trauma-related illnesses, it does have a mental health focus.'
]

mh_terms = [
    'bipolar disorder', 'anxiety', 'substance abuse disorder',
    'ptsd', 'schizophrenia', 'mental health'
]

def _regex_word(text):
    """ wrap text with special regex expression for start/end of words """
    return '\\b{}\\b'.format(text)

def _normalize(text):
    """ Remove any non alpha/numeric/space character """
    return re.sub('[^a-z0-9 ]', '', text.lower())


normed_terms = [_normalize(term) for term in mh_terms]


for raw_abstract in abstracts:
    print('--------')
    normed_abstract = _normalize(raw_abstract)

    # Search for all occurrences of chosen terms
    found = {}
    for norm_term in normed_terms:
        pattern = _regex_word(norm_term)
        found[norm_term] = len(re.findall(pattern, normed_abstract))
    print('found = {!r}'.format(found))
    mh_total_occur = sum(found.values())
    print('mh_total_occur = {!r}'.format(mh_total_occur))

I tried to add helpers functions and comments to make it clear what I was doing. 我试图添加辅助函数和注释,以使自己清楚自己在做什么。

Using the \\b regex control character is important in general use cases because it prevents possible search terms like "miss" from matching words like "dismiss". 在一般情况下,使用\\b regex控制字符很重要,因为它可以防止可能的搜索词(例如“ miss”)与单词“ dismiss”(匹配)匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM