繁体   English   中英

Python将n-gram从字典匹配到文本字符串

[英]Python matching n-grams from a dictionary to a string of text

我有一个包含2个和3个单词的短语的字典,我想在rss feed中搜索匹配项。 我抓取rss feed,对其进行处理,它们最终以字符串形式出现在标题为“ documents”的列表中。 我想检查下面的字典,如果字典中的任何短语都与文本字符串的一部分匹配,我想返回该键的值。 我不确定解决此问题的最佳方法。 任何建议将不胜感激。

ngramList = {"cash outflows":-1, "pull out":-1,"winding down":-1,"most traded":-1,"steep gains":-1,"military strike":-1,
          "resumed operations":+1,"state aid":+1,"bail out":-1,"cut costs":-1,"alleged violations":-1,"under perform":-1,"more than expected":+1,
         "pay more taxes":-1,"not for sale":+1,"struck a deal":+1,"cash flow problems":-2}

我将所有字符串合并到一个正则表达式中,并遍历在文本中找到的匹配项。 我不确定100%,但是我认为Python中的regex实现足够聪明,可以将所有单词放进去,这将为您带来良好的性能。

strings = [re.escape(s) for s in ngramList.iterkeys()]
regex = re.compile(r'\b(' + '|'.join(strings) + r')\b', re.IGNORECASE)
for text in documents:
    scores = []
    for m in regex.finditer(text):
        scores.append(ngramList[m.group(1)])
    # process the scores here, e.g. add their sum to some a global variable:
    score += sum(scores)

我假设该词典中的数字(-2,-1,+ 1)是权重,因此您需要对每个文档中的每个短语都进行计数才能使它们有用。

因此,执行此操作的伪代码为:

  1. 将文档分成几行,然后将每一行分成单词列表。
  2. 然后循环遍历一行中的每个单词,循环遍历该行中的向前和向后以生成各种短语。
  3. 生成每个短语时,请保留一个全局词典,其中包含该短语和出现次数。

这是一些简单的代码,用于查找文档中每个短语的计数,这似乎是您要尝试执行的操作:

text = """
I have a dictionary of 2 and 3 word phrases that I want to search in rss feeds for a match. 

I grab   the rss feeds, process them and they end up as a string IN a list entitled "documents". 
I want to check the dictionary below and if any of the phrases in the dictionary match part of a string of text I want to return the values for the key. 
I am not sure about the best way to approach this problem. Any suggestions would be greatly appreciated.
"""

ngrams = ["grab the rss", "approach this", "in"]

import re

counts = {}
for ngram in ngrams:
    words = ngram.rsplit()
    pattern = re.compile(r'%s' % "\s+".join(words),
        re.IGNORECASE)
    counts[ngram] = len(pattern.findall(text))

print counts

输出:

{'grab the rss': 1, 'approach this': 1, 'in': 5}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM