Python将n-gram从字典匹配到文本字符串

Question

我有一个包含2个和3个单词的短语的字典，我想在rss feed中搜索匹配项。 我抓取rss feed，对其进行处理，它们最终以字符串形式出现在标题为“ documents”的列表中。 我想检查下面的字典，如果字典中的任何短语都与文本字符串的一部分匹配，我想返回该键的值。 我不确定解决此问题的最佳方法。 任何建议将不胜感激。

ngramList = {"cash outflows":-1, "pull out":-1,"winding down":-1,"most traded":-1,"steep gains":-1,"military strike":-1,
          "resumed operations":+1,"state aid":+1,"bail out":-1,"cut costs":-1,"alleged violations":-1,"under perform":-1,"more than expected":+1,
         "pay more taxes":-1,"not for sale":+1,"struck a deal":+1,"cash flow problems":-2}

Answer 1

我将所有字符串合并到一个正则表达式中，并遍历在文本中找到的匹配项。 我不确定100％，但是我认为Python中的regex实现足够聪明，可以将所有单词放进去，这将为您带来良好的性能。

strings = [re.escape(s) for s in ngramList.iterkeys()]
regex = re.compile(r'\b(' + '|'.join(strings) + r')\b', re.IGNORECASE)
for text in documents:
    scores = []
    for m in regex.finditer(text):
        scores.append(ngramList[m.group(1)])
    # process the scores here, e.g. add their sum to some a global variable:
    score += sum(scores)

Answer 2

我假设该词典中的数字（-2，-1，+ 1）是权重，因此您需要对每个文档中的每个短语都进行计数才能使它们有用。

因此，执行此操作的伪代码为：

将文档分成几行，然后将每一行分成单词列表。
然后循环遍历一行中的每个单词，循环遍历该行中的向前和向后以生成各种短语。
生成每个短语时，请保留一个全局词典，其中包含该短语和出现次数。

这是一些简单的代码，用于查找文档中每个短语的计数，这似乎是您要尝试执行的操作：

text = """
I have a dictionary of 2 and 3 word phrases that I want to search in rss feeds for a match. 

I grab   the rss feeds, process them and they end up as a string IN a list entitled "documents". 
I want to check the dictionary below and if any of the phrases in the dictionary match part of a string of text I want to return the values for the key. 
I am not sure about the best way to approach this problem. Any suggestions would be greatly appreciated.
"""

ngrams = ["grab the rss", "approach this", "in"]

import re

counts = {}
for ngram in ngrams:
    words = ngram.rsplit()
    pattern = re.compile(r'%s' % "\s+".join(words),
        re.IGNORECASE)
    counts[ngram] = len(pattern.findall(text))

print counts

输出：

{'grab the rss': 1, 'approach this': 1, 'in': 5}

Python将n-gram从字典匹配到文本字符串

问题描述

2 个解决方案

解决方案1
2 2013-10-06 19:33:27

解决方案2
2 已采纳 2013-10-06 20:08:51

Python将n-gram从字典匹配到文本字符串

问题描述

2 个解决方案

解决方案1 2 2013-10-06 19:33:27

解决方案2 2 已采纳 2013-10-06 20:08:51

解决方案1
2 2013-10-06 19:33:27

解决方案2
2 已采纳 2013-10-06 20:08:51