如何使用Python-NLTK基于词汇内容（短语）解析句子

Question

Can Python-NLTK recognize input string and parse it not only based on white space but also on the content? Python-NLTK可以识别输入字符串并解析它不仅基于空格而且还基于内容？ Say, "computer system" became a phrases in this situation. 说，“计算机系统”成为这种情况下的短语。 Can anyone provide a sample code? 任何人都可以提供示例代码吗？

input String : "A survey of user opinion of computer system response time" 输入字符串 ：“用户对计算机系统响应时间的意见调查”

Expected output : ["A", "survey", "of", "user", "opinion", "of", "computer system", "response", "time"] 预期输出 ：[“A”，“调查”，“of”，“用户”，“意见”，“of”，“计算机系统”，“响应”，“时间”]

Answer 1

The technology you're looking for is called multiple names from multiple sub-fields or sub-sub-fields of linguistics and computing. 您正在寻找的技术称为来自多个子领域的多个名称或语言学和计算的子子领域。

Keyphrase Extraction 关键词提取
- From Information Retrieval, mainly use for improving indexing/querying for sear 来自信息检索，主要用于改进索引的索引/查询
- Read this recent survey paper: http://www.hlt.utdallas.edu/~saidul/acl14.pdf 阅读最近的调查报告： http ： //www.hlt.utdallas.edu/~saidul/acl14.pdf
- (I personally) strongly recommend: https://code.google.com/p/jatetoolkit/ and of course the famous https://code.google.com/p/kea-algorithm/ (from the people who brought you WEKA, http://www.cs.waikato.ac.nz/ml/weka/ ) （我个人）强烈建议： https ： //code.google.com/p/jatetoolkit/ ，当然还有着名的https://code.google.com/p/kea-algorithm/ （来自给你带来WEKA的人， http：//www.cs.waikato.ac.nz/ml/weka/ ）
- For python, possibly https://github.com/aneesha/RAKE 对于python，可能是https://github.com/aneesha/RAKE

Chunking 分块
- From Natural Language Processing, it's also call shallow parsing, 从自然语言处理，它也称为浅层解析，
- Read Steve Abney's work on how it came about: http://www.vinartus.net/spa/90e.pdf 阅读Steve Abney关于它如何发生的工作： http ： //www.vinartus.net/spa/90e.pdf
- Major NLP framework and toolkits should have them (eg OpenNLP, GATE, NLTK* (do note that NLTK's default chunker only works for name entities)) 主要的NLP框架和工具包应该有它们（例如OpenNLP，GATE，NLTK *（请注意，NLTK的默认chunker仅适用于名称实体））
- Stanford NLP has one too: http://nlp.stanford.edu/projects/shallow-parsing.shtml 斯坦福大学NLP也有一个： http ： //nlp.stanford.edu/projects/shallow-parsing.shtml

I'll give an example of the NE chunker in NLTK: 我将举例说明NLTK中的NE chunker：

>>> from nltk import word_tokenize, ne_chunk, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent)))
>>> for i in chunked:
...     print i
... 
('A', 'DT')
('survey', 'NN')
('of', 'IN')
('user', 'NN')
('opinion', 'NN')
('of', 'IN')
('computer', 'NN')
('system', 'NN')
('response', 'NN')
('time', 'NN')

With named entities: 使用命名实体：

>>> sent2 = "Barack Obama meets Michael Jackson in Nihonbashi"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent2)))
>>> for i in chunked:
...     print i
... 
(PERSON Barack/NNP)
(ORGANIZATION Obama/NNP)
('meets', 'NNS')
(PERSON Michael/NNP Jackson/NNP)
('in', 'IN')
(GPE Nihonbashi/NNP)

You can see it's pretty much flawed, better something than nothing, i guess. 我猜你可以看到它有很多缺陷，更好的东西比什么都没有。

Multi-Word Expression extraction 多字表达提取
- Hot topic in NLP, everyone wants to extract them for one reason or another NLP中的热门话题，每个人都想出于某种原因提取它们
- Most notable work by Ivan Sag: http://lingo.stanford.edu/pubs/WP-2001-03.pdf and a miasma of all sorts of extraction algorithms and extracted usage from ACL papers Ivan Sag最值得注意的工作： http ： //lingo.stanford.edu/pubs/WP-2001-03.pdf以及各种提取算法的m气和ACL论文的提取用法
- As much as this MWE is very mysterious and we don't know how to classify them automatically or extract them properly, there's no proper tools for this (strangely the output researchers of MWE wants often can be obtained with Keyphrase Extraction or chunking...) 尽管这个MWE非常神秘，我们不知道如何自动分类或正确提取它们，但没有适当的工具（奇怪的是MWE的输出研究人员通常可以通过Keyphrase Extraction或chunking获得... ）

Terminology Extraction 术语提取
- This comes from translation studies where they want the translators to use the correct technical word when translating a document. 这来自翻译研究，他们希望翻译人员在翻译文档时使用正确的技术词汇。
- Do note that terminology comes with a cornocopia of ISO standards that one should follows because of the convoluted translation industry that generates billions in income... 请注意，术语附带ISO标准的玉米种，应该遵循，因为翻译行业复杂，产生了数十亿的收入......
- Monolingually, i've no idea what makes them different from terminology extractor, same algorithms, different interface... I guess the only thing about some term extractors is the ability to do it bilingually and produce a dictionary automatically. 单语言，我不知道是什么使它们与术语提取器，相同的算法，不同的接口有所不同...我想一些术语提取器的唯一的事情是能够双语做并自动生成字典。
Here's a few tools 这是一些工具
- https://github.com/srijiths/jtopia and https://github.com/srijiths/jtopia和
- http://fivefilters.org/term-extraction/ http://fivefilters.org/term-extraction/
- https://github.com/turian/topia.termextract https://github.com/turian/topia.termextract
- https://www.airpair.com/nlp/keyword-extraction-tutorial https://www.airpair.com/nlp/keyword-extraction-tutorial
- http://termcoord.wordpress.com/about/testing-of-term-extraction-tools/free-term-extractors/ http://termcoord.wordpress.com/about/testing-of-term-extraction-tools/free-term-extractors/
- Note on tools: there's still no one tool that stands out for term extraction though. 关于工具的注意事项：尽管如此，仍然没有一种工具可以用于术语提取。 And because of then big money involved, it's always some API calls and most code are "semi-open".. mostly closed. 由于当时涉及大笔资金，它总是有一些API调用，而且大多数代码都是“半开放”的......大部分都是关闭的。 Then again, SEO is also big money, possibly it's just a culture thing in translation industry to be super secretive. 再说一次，搜索引擎优化也是一笔巨款，可能只是翻译行业的文化事物才是超级秘密。

Now back to OP's question. 现在回到OP的问题。

Q: Can NLTK extract "computer system" as a phrase? 问： NLTK可以提取“计算机系统”作为短语吗？

A: Not really 答：不是

As shown above, NLTK has pre-trained chunker but it works on name entities and even so, not all named entities are well recognized. 如上所示，NLTK具有预先训练的chunker，但它适用于名称实体，即便如此，并非所有命名实体都能被很好地识别。

Possibly OP could try out more radical idea, let's assume that a sequence of nouns together always form a phrase: 可能OP可以尝试更激进的想法，让我们假设一个名词序列总是形成一个短语：

>>> from nltk import word_tokenize, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> tagged = pos_tag(word_tokenize(sent))
>>> chunks = []
>>> current_chunk = []
>>> for word, pos in tagged:
...     if pos.startswith('N'):
...             current_chunk.append((word,pos))
...     else:
...             if current_chunk:
...                     chunks.append(current_chunk)
...             current_chunk = []
... 
>>> chunks
[[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')], [('survey', 'NN')], [('user', 'NN'), ('opinion', 'NN')]]
>>> for i in chunks:
...     print i
... 
[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')]
[('survey', 'NN')]
[('user', 'NN'), ('opinion', 'NN')]

So even with that solution, seems like trying to get 'computer system' alone is hard. 因此，即使使用该解决方案，似乎只是试图让“计算机系统”变得困难。 But if you think for a bit seems like getting 'computer system response time' is a more valid phrase than 'computer system'. 但是，如果你认为有点像“计算机系统响应时间”是一个比“计算机系统”更有效的短语。

Do not that all interpretations of computer system response time seem valid: 难道并非所有对计算机系统响应时间的解释都是有效的：

[computer system response time] [计算机系统响应时间]
[computer [system [response [time]]]] [computer [system [response [time]]]]
[computer system] [response time] [计算机系统] [响应时间]
[computer [system response time]] [电脑[系统响应时间]]

And many many more possible interpretations. 还有许多可能的解释。 So you've got to ask, what are you using the extracted phrase for and then see how to proceed with cutting long phrases like 'computer system response time'. 所以你必须问，你在使用提取的短语是什么，然后看看如何继续削减像“计算机系统响应时间”这样的长短语。

如何使用Python-NLTK基于词汇内容（短语）解析句子

问题描述

1 个解决方案

解决方案1
18 已采纳 2014-12-02 00:50:36

如何使用Python-NLTK基于词汇内容（短语）解析句子

问题描述

1 个解决方案

解决方案1 18 已采纳 2014-12-02 00:50:36

解决方案1
18 已采纳 2014-12-02 00:50:36