繁体   English   中英

从斯坦福分析器的无上下文短语结构输出中提取信息

[英]Extracting information from context-free phrase structure output from Stanford Parser

Stanford Parser(http://nlp.stanford.edu/software/lex-parser.shtml)给出了无上下文的短语结构树,如下所示。 提取树中所有名词短语(NP)和动词短语(NP)的最佳方法是什么? 是否有任何Python(或Java)库可以让我读取这些结构? 谢谢。

(ROOT
  (S
    (S
      (NP
        (NP (DT The) (JJS strongest) (NN rain))
        (VP
          (ADVP (RB ever))
          (VBN recorded)
          (PP (IN in)
            (NP (NNP India)))))
      (VP
        (VP (VBD shut)
          (PRT (RP down))
          (NP
            (NP (DT the) (JJ financial) (NN hub))
            (PP (IN of)
              (NP (NNP Mumbai)))))
        (, ,)
        (VP (VBD snapped)
          (NP (NN communication) (NNS lines)))
        (, ,)
        (VP (VBD closed)
          (NP (NNS airports)))
        (CC and)
        (VP (VBD forced)
          (NP
            (NP (NNS thousands))
            (PP (IN of)
              (NP (NNS people))))
          (S
            (VP (TO to)
              (VP
                (VP (VB sleep)
                  (PP (IN in)
                    (NP (PRP$ their) (NNS offices))))
                (CC or)
                (VP (VB walk)
                  (NP (NN home))
                  (PP (IN during)
                    (NP (DT the) (NN night))))))))))
    (, ,)
    (NP (NNS officials))
    (VP (VBD said)
      (NP-TMP (NN today)))
    (. .)))

查看nltk.org上的Natural Language Toolkit(NLTK)。

该工具包是用Python编写的,它提供了精确读取这些树(以及许多其他东西)的代码。

或者,您可以编写自己的递归函数来执行此操作。 这将非常简单。


只是为了好玩:这是一个超级简单的实现你想要的:

def parse():
  itr = iter(filter(lambda x: x, re.split("\\s+", s.replace('(', ' ( ').replace(')', ' ) '))))

  def _parse():
    stuff = []
    for x in itr:
      if x == ')':
        return stuff
      elif x == '(':
        stuff.append(_parse())
      else:
        stuff.append(x)
    return stuff

  return _parse()[0]

def find(parsed, tag):
  if parsed[0] == tag:
    yield parsed
  for x in parsed[1:]:
    for y in find(x, tag):
      yield y

p = parse()
np = find(p, 'NP')
for x in np:
  print x

收益率:

['NP', ['NP', ['DT', 'The'], ['JJS', 'strongest'], ['NN', 'rain']], ['VP', ['ADVP', ['RB', 'ever']], ['VBN', 'recorded'], ['PP', ['IN', 'in'], ['NP', ['NNP', 'India']]]]]
['NP', ['DT', 'The'], ['JJS', 'strongest'], ['NN', 'rain']]
['NP', ['NNP', 'India']]
['NP', ['NP', ['DT', 'the'], ['JJ', 'financial'], ['NN', 'hub']], ['PP', ['IN', 'of' ['NP', ['NNP', 'Mumbai']]]]
['NP', ['DT', 'the'], ['JJ', 'financial'], ['NN', 'hub']]
['NP', ['NNP', 'Mumbai']]
['NP', ['NN', 'communication'], ['NNS', 'lines']]
['NP', ['NNS', 'airports']]
['NP', ['NP', ['NNS', 'thousands']], ['PP', ['IN', 'of'], ['NP', ['NNS', 'people']]]]
['NP', ['NNS', 'thousands']]
['NP', ['NNS', 'people']]
['NP', ['PRP$', 'their'], ['NNS', 'offices']]
['NP', ['NN', 'home']]
['NP', ['DT', 'the'], ['NN', 'night']]
['NP', ['NNS', 'officials']]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM