Python中的NLP短语搜索

Question

I have been going through many Libraries like whoosh/nltk and concepts like word net. 我已经遍历了许多图书馆，例如whoosh / nltk和概念，例如词网。

However I am unable to tackle my problem. 但是我无法解决我的问题。 I am not sure if I can find a library for this or I have to build this using the above mentioned resources. 我不确定是否可以为此找到一个库，还是必须使用上述资源来构建它。

Question: My scenario is that I have to search for key words. 问题：我的情况是我必须搜索关键词。 Say I have key words like 'Sales Document' / 'Purchase Documents' and have to search for them in a small 10-15 pages book. 假设我有“销售文档” /“购买文档”之类的关键字，并且必须在10到15页的小型书中进行搜索。

The catch is: Now they can also be written as 'Sales should be documented' or 'company selling should be written in the text files'. 要注意的是：现在它们也可以写为“应记录销售记录”或“公司销售应记录在文本文件中”。 (For Sales Document - Keyword) Is there an approach here or will I have to build something? （对于销售单据-关键字）这里有一种方法还是我必须构建一些方法？

The code for the POS Tags is as follows. POS标签的代码如下。 If no library is available I will have to proceed with this. 如果没有可用的库，则必须继续进行。

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
import nltk
from nltk.corpus import wordnet

def tag(x):
    return pos_tag(word_tokenize(x))



synonyms = []
antonyms = []

for syn in wordnet.synsets("Sales document"):
    #print("Down2")
    print (syn)
    #print("Down")
    for l in syn.lemmas():
        print(" \n")
        print(l)
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

for i in synonyms:
    print(tag(i))

Update: We went ahead and made a python program - Feel free to fork it. 更新：我们继续并制作了一个python程序-随意进行分叉。 (Pun intended) Further the Git Dhund is very untidy right now will clean it once completed. （Pun打算）此外，Git Dhund现在非常不整洁，一旦完成，将对其进行清洁。 Currently it is still in a development phase. 目前，它仍处于开发阶段。

The is the link . 是链接。

Answer 1

To match occurrences like "Sales should be documented" , this can be done by increasing the slop parameter in the Phrase query object of Whoosh. 为了匹配“销售应记录在案”之类的事件 ，可以通过增加slop的短语查询对象中的slop参数来完成。

whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None) slop – the number of words allowed between each “word” in the phrase; whoosh.query.Phrase（字段名，单词，slop = 1，boost = 1.0，char_ranges = None）slop –短语中每个“单词”之间允许的单词数； the default of 1 means the phrase must match exactly. 默认值1表示词组必须完全匹配。

You can also define slop in Query like this: "Sales should be documented"~5 您还可以在Query中定义如下代码： "Sales should be documented"~5

To match the second example "company selling should be written in the text files" , this needs a semantic processing for your texts. 为了匹配第二个示例“公司销售应写在文本文件中” ，这需要对文本进行语义处理。 Whoosh has a low-level implementation for wordnet thesaurus to allow you index synonyms but it has only one-word synonyms. Whoosh对Wordnet同义词库有一个低级实现，可让您索引同义词，但它只有一个单词的同义词。

Python中的NLP短语搜索

问题描述

1 个解决方案

解决方案1
2 2018-06-23 11:23:33

Python中的NLP短语搜索

问题描述

1 个解决方案

解决方案1 2 2018-06-23 11:23:33

解决方案1
2 2018-06-23 11:23:33