从斯坦福分析器的无上下文短语结构输出中提取信息

Question

Stanford Parser（http://nlp.stanford.edu/software/lex-parser.shtml）给出了无上下文的短语结构树，如下所示。 提取树中所有名词短语（NP）和动词短语（NP）的最佳方法是什么？ 是否有任何Python（或Java）库可以让我读取这些结构？ 谢谢。

(ROOT
  (S
    (S
      (NP
        (NP (DT The) (JJS strongest) (NN rain))
        (VP
          (ADVP (RB ever))
          (VBN recorded)
          (PP (IN in)
            (NP (NNP India)))))
      (VP
        (VP (VBD shut)
          (PRT (RP down))
          (NP
            (NP (DT the) (JJ financial) (NN hub))
            (PP (IN of)
              (NP (NNP Mumbai)))))
        (, ,)
        (VP (VBD snapped)
          (NP (NN communication) (NNS lines)))
        (, ,)
        (VP (VBD closed)
          (NP (NNS airports)))
        (CC and)
        (VP (VBD forced)
          (NP
            (NP (NNS thousands))
            (PP (IN of)
              (NP (NNS people))))
          (S
            (VP (TO to)
              (VP
                (VP (VB sleep)
                  (PP (IN in)
                    (NP (PRP$ their) (NNS offices))))
                (CC or)
                (VP (VB walk)
                  (NP (NN home))
                  (PP (IN during)
                    (NP (DT the) (NN night))))))))))
    (, ,)
    (NP (NNS officials))
    (VP (VBD said)
      (NP-TMP (NN today)))
    (. .)))

Answer 1

查看nltk.org上的Natural Language Toolkit（NLTK）。

该工具包是用Python编写的，它提供了精确读取这些树（以及许多其他东西）的代码。

或者，您可以编写自己的递归函数来执行此操作。 这将非常简单。

只是为了好玩：这是一个超级简单的实现你想要的：

def parse():
  itr = iter(filter(lambda x: x, re.split("\\s+", s.replace('(', ' ( ').replace(')', ' ) '))))

  def _parse():
    stuff = []
    for x in itr:
      if x == ')':
        return stuff
      elif x == '(':
        stuff.append(_parse())
      else:
        stuff.append(x)
    return stuff

  return _parse()[0]

def find(parsed, tag):
  if parsed[0] == tag:
    yield parsed
  for x in parsed[1:]:
    for y in find(x, tag):
      yield y

p = parse()
np = find(p, 'NP')
for x in np:
  print x

收益率：

['NP', ['NP', ['DT', 'The'], ['JJS', 'strongest'], ['NN', 'rain']], ['VP', ['ADVP', ['RB', 'ever']], ['VBN', 'recorded'], ['PP', ['IN', 'in'], ['NP', ['NNP', 'India']]]]]
['NP', ['DT', 'The'], ['JJS', 'strongest'], ['NN', 'rain']]
['NP', ['NNP', 'India']]
['NP', ['NP', ['DT', 'the'], ['JJ', 'financial'], ['NN', 'hub']], ['PP', ['IN', 'of' ['NP', ['NNP', 'Mumbai']]]]
['NP', ['DT', 'the'], ['JJ', 'financial'], ['NN', 'hub']]
['NP', ['NNP', 'Mumbai']]
['NP', ['NN', 'communication'], ['NNS', 'lines']]
['NP', ['NNS', 'airports']]
['NP', ['NP', ['NNS', 'thousands']], ['PP', ['IN', 'of'], ['NP', ['NNS', 'people']]]]
['NP', ['NNS', 'thousands']]
['NP', ['NNS', 'people']]
['NP', ['PRP$', 'their'], ['NNS', 'offices']]
['NP', ['NN', 'home']]
['NP', ['DT', 'the'], ['NN', 'night']]
['NP', ['NNS', 'officials']]

从斯坦福分析器的无上下文短语结构输出中提取信息

问题描述

1 个解决方案

解决方案1
2 已采纳 2012-03-21 03:51:51

从斯坦福分析器的无上下文短语结构输出中提取信息

问题描述

1 个解决方案

解决方案1 2 已采纳 2012-03-21 03:51:51

解决方案1
2 已采纳 2012-03-21 03:51:51