简体   繁体   English

Python中的NLP短语搜索

[英]NLP Phrase Search in Python

I have been going through many Libraries like whoosh/nltk and concepts like word net. 我已经遍历了许多图书馆,例如whoosh / nltk和概念,例如词网。

However I am unable to tackle my problem. 但是我无法解决我的问题。 I am not sure if I can find a library for this or I have to build this using the above mentioned resources. 我不确定是否可以为此找到一个库,还是必须使用上述资源来构建它。

Question: My scenario is that I have to search for key words. 问题:我的情况是我必须搜索关键词。 Say I have key words like 'Sales Document' / 'Purchase Documents' and have to search for them in a small 10-15 pages book. 假设我有“销售文档” /“购买文档”之类的关键字,并且必须在10到15页的小型书中进行搜索。

The catch is: Now they can also be written as 'Sales should be documented' or 'company selling should be written in the text files'. 要注意的是:现在它们也可以写为“应记录销售记录”或“公司销售应记录在文本文件中”。 (For Sales Document - Keyword) Is there an approach here or will I have to build something? (对于销售单据-关键字)这里有一种方法还是我必须构建一些方法?

The code for the POS Tags is as follows. POS标签的代码如下。 If no library is available I will have to proceed with this. 如果没有可用的库,则必须继续进行。

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
import nltk
from nltk.corpus import wordnet

def tag(x):
    return pos_tag(word_tokenize(x))



synonyms = []
antonyms = []

for syn in wordnet.synsets("Sales document"):
    #print("Down2")
    print (syn)
    #print("Down")
    for l in syn.lemmas():
        print(" \n")
        print(l)
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

for i in synonyms:
    print(tag(i))

Update: We went ahead and made a python program - Feel free to fork it. 更新:我们继续并制作了一个python程序-随意进行分叉。 (Pun intended) Further the Git Dhund is very untidy right now will clean it once completed. (Pun打算)此外,Git Dhund现在非常不整洁,一旦完成,将对其进行清洁。 Currently it is still in a development phase. 目前,它仍处于开发阶段。

The is the link . 链接

To match occurrences like "Sales should be documented" , this can be done by increasing the slop parameter in the Phrase query object of Whoosh. 为了匹配“销售应记录在案”之类的事件 ,可以通过增加slop的短语查询对象中的slop参数来完成。

whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None) slop – the number of words allowed between each “word” in the phrase; whoosh.query.Phrase(字段名,单词,slop = 1,boost = 1.0,char_ranges = None)slop –短语中每个“单词”之间允许的单词数; the default of 1 means the phrase must match exactly. 默认值1表示词组必须完全匹配。

You can also define slop in Query like this: "Sales should be documented"~5 您还可以在Query中定义如下代码: "Sales should be documented"~5


To match the second example "company selling should be written in the text files" , this needs a semantic processing for your texts. 为了匹配第二个示例“公司销售应写在文本文件中” ,这需要对文本进行语义处理。 Whoosh has a low-level implementation for wordnet thesaurus to allow you index synonyms but it has only one-word synonyms. Whoosh对Wordnet同义词库有一个低级实现,可让您索引同义词,但它只有一个单词的同义词。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM