[英]How to parse sentences based on lexical content (phrases) with Python-NLTK
Can Python-NLTK recognize input string and parse it not only based on white space but also on the content? Python-NLTK可以识别输入字符串并解析它不仅基于空格而且还基于内容? Say, "computer system" became a phrases in this situation.
说,“计算机系统”成为这种情况下的短语。 Can anyone provide a sample code?
任何人都可以提供示例代码吗?
input String : "A survey of user opinion of computer system response time" 输入字符串 :“用户对计算机系统响应时间的意见调查”
Expected output : ["A", "survey", "of", "user", "opinion", "of", "computer system", "response", "time"] 预期输出 :[“A”,“调查”,“of”,“用户”,“意见”,“of”,“计算机系统”,“响应”,“时间”]
The technology you're looking for is called multiple names from multiple sub-fields or sub-sub-fields of linguistics and computing. 您正在寻找的技术称为来自多个子领域的多个名称或语言学和计算的子子领域。
I'll give an example of the NE chunker in NLTK: 我将举例说明NLTK中的NE chunker:
>>> from nltk import word_tokenize, ne_chunk, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent)))
>>> for i in chunked:
... print i
...
('A', 'DT')
('survey', 'NN')
('of', 'IN')
('user', 'NN')
('opinion', 'NN')
('of', 'IN')
('computer', 'NN')
('system', 'NN')
('response', 'NN')
('time', 'NN')
With named entities: 使用命名实体:
>>> sent2 = "Barack Obama meets Michael Jackson in Nihonbashi"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent2)))
>>> for i in chunked:
... print i
...
(PERSON Barack/NNP)
(ORGANIZATION Obama/NNP)
('meets', 'NNS')
(PERSON Michael/NNP Jackson/NNP)
('in', 'IN')
(GPE Nihonbashi/NNP)
You can see it's pretty much flawed, better something than nothing, i guess. 我猜你可以看到它有很多缺陷,更好的东西比什么都没有。
Terminology Extraction 术语提取
Here's a few tools 这是一些工具
Now back to OP's question. 现在回到OP的问题。
Q: Can NLTK extract "computer system" as a phrase? 问: NLTK可以提取“计算机系统”作为短语吗?
A: Not really 答: 不是
As shown above, NLTK has pre-trained chunker but it works on name entities and even so, not all named entities are well recognized. 如上所示,NLTK具有预先训练的chunker,但它适用于名称实体,即便如此,并非所有命名实体都能被很好地识别。
Possibly OP could try out more radical idea, let's assume that a sequence of nouns together always form a phrase: 可能OP可以尝试更激进的想法,让我们假设一个名词序列总是形成一个短语:
>>> from nltk import word_tokenize, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> tagged = pos_tag(word_tokenize(sent))
>>> chunks = []
>>> current_chunk = []
>>> for word, pos in tagged:
... if pos.startswith('N'):
... current_chunk.append((word,pos))
... else:
... if current_chunk:
... chunks.append(current_chunk)
... current_chunk = []
...
>>> chunks
[[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')], [('survey', 'NN')], [('user', 'NN'), ('opinion', 'NN')]]
>>> for i in chunks:
... print i
...
[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')]
[('survey', 'NN')]
[('user', 'NN'), ('opinion', 'NN')]
So even with that solution, seems like trying to get 'computer system' alone is hard. 因此,即使使用该解决方案,似乎只是试图让“计算机系统”变得困难。 But if you think for a bit seems like getting 'computer system response time' is a more valid phrase than 'computer system'.
但是,如果你认为有点像“计算机系统响应时间”是一个比“计算机系统”更有效的短语。
Do not that all interpretations of computer system response time seem valid: 难道并非所有对计算机系统响应时间的解释都是有效的:
And many many more possible interpretations. 还有许多可能的解释。 So you've got to ask, what are you using the extracted phrase for and then see how to proceed with cutting long phrases like 'computer system response time'.
所以你必须问,你在使用提取的短语是什么,然后看看如何继续削减像“计算机系统响应时间”这样的长短语。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.