[英]NLP Phrase Search in Python
I have been going through many Libraries like whoosh/nltk and concepts like word net. 我已经遍历了许多图书馆,例如whoosh / nltk和概念,例如词网。
However I am unable to tackle my problem. 但是我无法解决我的问题。 I am not sure if I can find a library for this or I have to build this using the above mentioned resources.
我不确定是否可以为此找到一个库,还是必须使用上述资源来构建它。
Question: My scenario is that I have to search for key words. 问题:我的情况是我必须搜索关键词。 Say I have key words like 'Sales Document' / 'Purchase Documents' and have to search for them in a small 10-15 pages book.
假设我有“销售文档” /“购买文档”之类的关键字,并且必须在10到15页的小型书中进行搜索。
The catch is: Now they can also be written as 'Sales should be documented' or 'company selling should be written in the text files'. 要注意的是:现在它们也可以写为“应记录销售记录”或“公司销售应记录在文本文件中”。 (For Sales Document - Keyword) Is there an approach here or will I have to build something?
(对于销售单据-关键字)这里有一种方法还是我必须构建一些方法?
The code for the POS Tags is as follows. POS标签的代码如下。 If no library is available I will have to proceed with this.
如果没有可用的库,则必须继续进行。
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
import nltk
from nltk.corpus import wordnet
def tag(x):
return pos_tag(word_tokenize(x))
synonyms = []
antonyms = []
for syn in wordnet.synsets("Sales document"):
#print("Down2")
print (syn)
#print("Down")
for l in syn.lemmas():
print(" \n")
print(l)
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(set(synonyms))
print(set(antonyms))
for i in synonyms:
print(tag(i))
Update: We went ahead and made a python program - Feel free to fork it. 更新:我们继续并制作了一个python程序-随意进行分叉。 (Pun intended) Further the Git Dhund is very untidy right now will clean it once completed.
(Pun打算)此外,Git Dhund现在非常不整洁,一旦完成,将对其进行清洁。 Currently it is still in a development phase.
目前,它仍处于开发阶段。
To match occurrences like "Sales should be documented" , this can be done by increasing the slop
parameter in the Phrase query object of Whoosh. 为了匹配“销售应记录在案”之类的事件 ,可以通过增加
slop
的短语查询对象中的slop参数来完成。
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None) slop – the number of words allowed between each “word” in the phrase;
whoosh.query.Phrase(字段名,单词,slop = 1,boost = 1.0,char_ranges = None)slop –短语中每个“单词”之间允许的单词数; the default of 1 means the phrase must match exactly.
默认值1表示词组必须完全匹配。
You can also define slop in Query like this: "Sales should be documented"~5
您还可以在Query中定义如下代码:
"Sales should be documented"~5
To match the second example "company selling should be written in the text files" , this needs a semantic processing for your texts. 为了匹配第二个示例“公司销售应写在文本文件中” ,这需要对文本进行语义处理。 Whoosh has a low-level implementation for wordnet thesaurus to allow you index synonyms but it has only one-word synonyms.
Whoosh对Wordnet同义词库有一个低级实现,可让您索引同义词,但它只有一个单词的同义词。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.