简体   繁体   English

如何使用Python-NLTK基于词汇内容(短语)解析句子

[英]How to parse sentences based on lexical content (phrases) with Python-NLTK

Can Python-NLTK recognize input string and parse it not only based on white space but also on the content? Python-NLTK可以识别输入字符串并解析它不仅基于空格而且还基于内容? Say, "computer system" became a phrases in this situation. 说,“计算机系统”成为这种情况下的短语。 Can anyone provide a sample code? 任何人都可以提供示例代码吗?


input String : "A survey of user opinion of computer system response time" 输入字符串 :“用户对计算机系统响应时间的意见调查”

Expected output : ["A", "survey", "of", "user", "opinion", "of", "computer system", "response", "time"] 预期输出 :[“A”,“调查”,“of”,“用户”,“意见”,“of”,“计算机系统”,“响应”,“时间”]

The technology you're looking for is called multiple names from multiple sub-fields or sub-sub-fields of linguistics and computing. 您正在寻找的技术称为来自多个子领域的多个名称或语言学和计算的子子领域。


I'll give an example of the NE chunker in NLTK: 我将举例说明NLTK中的NE chunker:

>>> from nltk import word_tokenize, ne_chunk, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent)))
>>> for i in chunked:
...     print i
... 
('A', 'DT')
('survey', 'NN')
('of', 'IN')
('user', 'NN')
('opinion', 'NN')
('of', 'IN')
('computer', 'NN')
('system', 'NN')
('response', 'NN')
('time', 'NN')

With named entities: 使用命名实体:

>>> sent2 = "Barack Obama meets Michael Jackson in Nihonbashi"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent2)))
>>> for i in chunked:
...     print i
... 
(PERSON Barack/NNP)
(ORGANIZATION Obama/NNP)
('meets', 'NNS')
(PERSON Michael/NNP Jackson/NNP)
('in', 'IN')
(GPE Nihonbashi/NNP)

You can see it's pretty much flawed, better something than nothing, i guess. 我猜你可以看到它有很多缺陷,更好的东西比什么都没有。


  • Multi-Word Expression extraction 多字表达提取
    • Hot topic in NLP, everyone wants to extract them for one reason or another NLP中的热门话题,每个人都想出于某种原因提取它们
    • Most notable work by Ivan Sag: http://lingo.stanford.edu/pubs/WP-2001-03.pdf and a miasma of all sorts of extraction algorithms and extracted usage from ACL papers Ivan Sag最值得注意的工作: http//lingo.stanford.edu/pubs/WP-2001-03.pdf以及各种提取算法的m气和ACL论文的提取用法
    • As much as this MWE is very mysterious and we don't know how to classify them automatically or extract them properly, there's no proper tools for this (strangely the output researchers of MWE wants often can be obtained with Keyphrase Extraction or chunking...) 尽管这个MWE非常神秘,我们不知道如何自动分类或正确提取它们,但没有适当的工具(奇怪的是MWE的输出研究人员通常可以通过Keyphrase Extraction或chunking获得... )


Now back to OP's question. 现在回到OP的问题。

Q: Can NLTK extract "computer system" as a phrase? 问: NLTK可以提取“计算机系统”作为短语吗?

A: Not really 答: 不是

As shown above, NLTK has pre-trained chunker but it works on name entities and even so, not all named entities are well recognized. 如上所示,NLTK具有预先训练的chunker,但它适用于名称实体,即便如此,并非所有命名实体都能被很好地识别。

Possibly OP could try out more radical idea, let's assume that a sequence of nouns together always form a phrase: 可能OP可以尝试更激进的想法,让我们假设一个名词序列总是形成一个短语:

>>> from nltk import word_tokenize, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> tagged = pos_tag(word_tokenize(sent))
>>> chunks = []
>>> current_chunk = []
>>> for word, pos in tagged:
...     if pos.startswith('N'):
...             current_chunk.append((word,pos))
...     else:
...             if current_chunk:
...                     chunks.append(current_chunk)
...             current_chunk = []
... 
>>> chunks
[[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')], [('survey', 'NN')], [('user', 'NN'), ('opinion', 'NN')]]
>>> for i in chunks:
...     print i
... 
[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')]
[('survey', 'NN')]
[('user', 'NN'), ('opinion', 'NN')]

So even with that solution, seems like trying to get 'computer system' alone is hard. 因此,即使使用该解决方案,似乎只是试图让“计算机系统”变得困难。 But if you think for a bit seems like getting 'computer system response time' is a more valid phrase than 'computer system'. 但是,如果你认为有点像“计算机系统响应时间”是一个比“计算机系统”更有效的短语。

Do not that all interpretations of computer system response time seem valid: 难道并非所有对计算机系统响应时间的解释都是有效的:

  • [computer system response time] [计算机系统响应时间]
  • [computer [system [response [time]]]] [computer [system [response [time]]]]
  • [computer system] [response time] [计算机系统] [响应时间]
  • [computer [system response time]] [电脑[系统响应时间]]

And many many more possible interpretations. 还有许多可能的解释。 So you've got to ask, what are you using the extracted phrase for and then see how to proceed with cutting long phrases like 'computer system response time'. 所以你必须问,你在使用提取的短语是什么,然后看看如何继续削减像“计算机系统响应时间”这样的长短语。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM