簡體   English   中英

從斯坦福分析器的無上下文短語結構輸出中提取信息

[英]Extracting information from context-free phrase structure output from Stanford Parser

Stanford Parser(http://nlp.stanford.edu/software/lex-parser.shtml)給出了無上下文的短語結構樹,如下所示。 提取樹中所有名詞短語(NP)和動詞短語(NP)的最佳方法是什么? 是否有任何Python(或Java)庫可以讓我讀取這些結構? 謝謝。

(ROOT
  (S
    (S
      (NP
        (NP (DT The) (JJS strongest) (NN rain))
        (VP
          (ADVP (RB ever))
          (VBN recorded)
          (PP (IN in)
            (NP (NNP India)))))
      (VP
        (VP (VBD shut)
          (PRT (RP down))
          (NP
            (NP (DT the) (JJ financial) (NN hub))
            (PP (IN of)
              (NP (NNP Mumbai)))))
        (, ,)
        (VP (VBD snapped)
          (NP (NN communication) (NNS lines)))
        (, ,)
        (VP (VBD closed)
          (NP (NNS airports)))
        (CC and)
        (VP (VBD forced)
          (NP
            (NP (NNS thousands))
            (PP (IN of)
              (NP (NNS people))))
          (S
            (VP (TO to)
              (VP
                (VP (VB sleep)
                  (PP (IN in)
                    (NP (PRP$ their) (NNS offices))))
                (CC or)
                (VP (VB walk)
                  (NP (NN home))
                  (PP (IN during)
                    (NP (DT the) (NN night))))))))))
    (, ,)
    (NP (NNS officials))
    (VP (VBD said)
      (NP-TMP (NN today)))
    (. .)))

查看nltk.org上的Natural Language Toolkit(NLTK)。

該工具包是用Python編寫的,它提供了精確讀取這些樹(以及許多其他東西)的代碼。

或者,您可以編寫自己的遞歸函數來執行此操作。 這將非常簡單。


只是為了好玩:這是一個超級簡單的實現你想要的:

def parse():
  itr = iter(filter(lambda x: x, re.split("\\s+", s.replace('(', ' ( ').replace(')', ' ) '))))

  def _parse():
    stuff = []
    for x in itr:
      if x == ')':
        return stuff
      elif x == '(':
        stuff.append(_parse())
      else:
        stuff.append(x)
    return stuff

  return _parse()[0]

def find(parsed, tag):
  if parsed[0] == tag:
    yield parsed
  for x in parsed[1:]:
    for y in find(x, tag):
      yield y

p = parse()
np = find(p, 'NP')
for x in np:
  print x

收益率:

['NP', ['NP', ['DT', 'The'], ['JJS', 'strongest'], ['NN', 'rain']], ['VP', ['ADVP', ['RB', 'ever']], ['VBN', 'recorded'], ['PP', ['IN', 'in'], ['NP', ['NNP', 'India']]]]]
['NP', ['DT', 'The'], ['JJS', 'strongest'], ['NN', 'rain']]
['NP', ['NNP', 'India']]
['NP', ['NP', ['DT', 'the'], ['JJ', 'financial'], ['NN', 'hub']], ['PP', ['IN', 'of' ['NP', ['NNP', 'Mumbai']]]]
['NP', ['DT', 'the'], ['JJ', 'financial'], ['NN', 'hub']]
['NP', ['NNP', 'Mumbai']]
['NP', ['NN', 'communication'], ['NNS', 'lines']]
['NP', ['NNS', 'airports']]
['NP', ['NP', ['NNS', 'thousands']], ['PP', ['IN', 'of'], ['NP', ['NNS', 'people']]]]
['NP', ['NNS', 'thousands']]
['NP', ['NNS', 'people']]
['NP', ['PRP$', 'their'], ['NNS', 'offices']]
['NP', ['NN', 'home']]
['NP', ['DT', 'the'], ['NN', 'night']]
['NP', ['NNS', 'officials']]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM