[英]Python matching n-grams from a dictionary to a string of text
我有一个包含2个和3个单词的短语的字典,我想在rss feed中搜索匹配项。 我抓取rss feed,对其进行处理,它们最终以字符串形式出现在标题为“ documents”的列表中。 我想检查下面的字典,如果字典中的任何短语都与文本字符串的一部分匹配,我想返回该键的值。 我不确定解决此问题的最佳方法。 任何建议将不胜感激。
ngramList = {"cash outflows":-1, "pull out":-1,"winding down":-1,"most traded":-1,"steep gains":-1,"military strike":-1,
"resumed operations":+1,"state aid":+1,"bail out":-1,"cut costs":-1,"alleged violations":-1,"under perform":-1,"more than expected":+1,
"pay more taxes":-1,"not for sale":+1,"struck a deal":+1,"cash flow problems":-2}
我将所有字符串合并到一个正则表达式中,并遍历在文本中找到的匹配项。 我不确定100%,但是我认为Python中的regex实现足够聪明,可以将所有单词放进去,这将为您带来良好的性能。
strings = [re.escape(s) for s in ngramList.iterkeys()]
regex = re.compile(r'\b(' + '|'.join(strings) + r')\b', re.IGNORECASE)
for text in documents:
scores = []
for m in regex.finditer(text):
scores.append(ngramList[m.group(1)])
# process the scores here, e.g. add their sum to some a global variable:
score += sum(scores)
我假设该词典中的数字(-2,-1,+ 1)是权重,因此您需要对每个文档中的每个短语都进行计数才能使它们有用。
因此,执行此操作的伪代码为:
这是一些简单的代码,用于查找文档中每个短语的计数,这似乎是您要尝试执行的操作:
text = """
I have a dictionary of 2 and 3 word phrases that I want to search in rss feeds for a match.
I grab the rss feeds, process them and they end up as a string IN a list entitled "documents".
I want to check the dictionary below and if any of the phrases in the dictionary match part of a string of text I want to return the values for the key.
I am not sure about the best way to approach this problem. Any suggestions would be greatly appreciated.
"""
ngrams = ["grab the rss", "approach this", "in"]
import re
counts = {}
for ngram in ngrams:
words = ngram.rsplit()
pattern = re.compile(r'%s' % "\s+".join(words),
re.IGNORECASE)
counts[ngram] = len(pattern.findall(text))
print counts
输出:
{'grab the rss': 1, 'approach this': 1, 'in': 5}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.