簡體   English   中英

將NLTK樹葉值作為字符串獲取

[英]Getting NLTK tree leaf values as a string

我正在嘗試獲取Tree對象中的葉子值作為字符串。 這里的樹對象是斯坦福解析器的輸出。

這是我的代碼:

from nltk.parse import stanford
Parser = stanford.StanfordParser("path")


example = "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back"
sentences = Parser.raw_parse(example)
for line in sentences:
    for sentence in line:
        tree = sentence

這就是我提取VP(動詞短語)葉子的方法。

VP=[]

VP_tree = list(tree.subtrees(filter=lambda x: x.label()=='VP'))

for i in VP_tree:
    VP.append(' '.join(i.flatten()))

這是i.flatten()的樣子:(它返回已解析的單詞列表)

(VP
  constructed
  logistic
  regression
  ,
  calibrated
  the
  low
  defaults
  portfolio
  to
  benchmark
  ratings)

因為我只能將它們作為已解析單詞的列表來獲取,所以我將它們與'一起加入。 因此,“回歸”和“,”之間有一個空格。

In [33]: VP
Out [33]: [u'constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings']

我想將動詞短語作為字符串(而不是作為解析詞的列表),而不必通過''將它們連接起來。

我看過Tree類( http://www.nltk.org/_modules/nltk/tree.html )下的方法,但到目前為止還算不上運氣。

簡而言之:

使用Tree.leaves()函數可訪問已分析句子Tree.leaves()樹的字符串,即:

VPs_str = [" ".join(vp.leaves()) for vp in list(parsed_sent.subtrees(filter=lambda x: x.label()=='VP'))]

沒有正確的方法來訪問真正的VP字符串,因為它們在輸入中是因為Stanford解析器在解析過程之前對文本進行了標記化,並且NLTK API =(不保留字符串的偏移量)


在長:

這個長答案是這樣的,其他NLTK用戶可以使用Stanford Parser的NLTK API來訪問Tree對象,這可能不是問題=)中顯示的那么瑣碎。

首先為NLTK設置訪問Stanford工具的環境變量,請參閱:

TL; DR

$ cd
$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-12-09.zip
$ unzip stanford-parser-full-2015-12-09.zip
$ export STANFORDTOOLSDIR=$HOME
$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar

應用針對2015年12月9日編譯的Stanford Parser的駭客(此駭客將在最新版本的https://github.com/nltk/nltk/pull/1280/files中不再使用):

>>> from nltk.internals import find_jars_within_path
>>> from nltk.parse.stanford import StanfordParser
>>> parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
>>> stanford_dir = parser._classpath[0].rpartition('/')[0]
>>> parser._classpath = tuple(find_jars_within_path(stanford_dir))

現在到短語提取。

首先,我們分析句子:

>>> sent = "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back"
>>> parsed_sent = list(parser.raw_parse(sent))[0]
>>> parsed_sent
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('JJ', ['Selected']), Tree('NNS', ['variables'])]), Tree('PP', [Tree('IN', ['by']), Tree('NP', [Tree('JJ', ['univariate/multivariate']), Tree('NN', ['analysis'])])]), Tree(',', [',']), Tree('VP', [Tree('VBN', ['constructed']), Tree('NP', [Tree('NP', [Tree('JJ', ['logistic']), Tree('NN', ['regression'])]), Tree(',', [',']), Tree('ADJP', [Tree('VBN', ['calibrated']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['low']), Tree('NNS', ['defaults']), Tree('NN', ['portfolio'])]), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('JJ', ['benchmark']), Tree('NNS', ['ratings'])])])])])])]), Tree(',', [','])]), Tree('VP', [Tree('VBD', ['performed']), Tree('ADVP', [Tree('RB', ['back'])])])])])

然后,我們遍歷該樹並按照以下步驟檢查VP:

>>> VP_tree = list(tree.subtrees(filter=lambda x: x.label()=='VP'))

事后,我們僅使用子樹葉子來獲取VP

>>> for vp in VPs:
...     print " ".join(vp.leaves())
... 
constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings
performed back

因此,獲得VP字符串:

>>> VPs_str = [" ".join(vp.leaves()) for vp in list(parsed_sent.subtrees(filter=lambda x: x.label()=='VP'))]
>>> VPs_str
[u'constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings', u'performed back']

另外,我個人更喜歡使用分塊器而不是完整的解析器來提取短語。

使用nltk_cli工具( https://github.com/alvations/nltk_cli ):

alvas@ubi:~/git/nltk_cli$ echo "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back" > input-doneyo.txt
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk VP input-doneyo.txt calibrated|to benchmark|performed
alvas@ubi:~/git/nltk_cli$ python senna.py --vp input-doneyo.txt 
calibrated|to benchmark|performed
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+NP input-doneyo.txt 
calibrated  the low defaults portfolio|to benchmark ratings

VP標簽的輸出由|分隔| ,即

輸出:

calibrated|to benchmark|performed

代表:

  • 已校准
  • 基准
  • 已執行

VP + NP塊輸出也由|分隔| VP和NP之間用\\t隔開,即

輸出:

calibrated  the low defaults portfolio|to benchmark ratings

代表(VP + NP):

  • 校准+低違約組合
  • 達到基准+評分

要根據輸入位置檢索字符串,應考慮使用https://github.com/smilli/py-corenlp代替斯坦福工具的NLTK API。

首先,您必須下載,安裝和設置Stanford CoreNLP,請參閱http://stanfordnlp.github.io/CoreNLP/corenlp-server.html#getting-started

然后將python包裝器安裝到CoreNLP, https://github.com/smilli/py-corenlp

然后, 在啟動服務器后 (許多人錯過了這一步!),在python中,您可以執行以下操作:

>>> from pycorenlp import StanfordCoreNLP
>>> stanford = StanfordCoreNLP('http://localhost:9000')
>>> text = ("Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back")
>>> output = stanford.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,depparse,parse', 'outputFormat': 'json'})
>>> print(output['sentences'][0]['parse'])
(ROOT
  (SINV
    (VP (VBN Selected)
      (NP (NNS variables))
      (PP (IN by)
        (NP
          (NP (JJ univariate/multivariate) (NN analysis))
          (, ,)
          (VP (VBN constructed)
            (NP (JJ logistic) (NN regression)))
          (, ,))))
    (VP (VBD calibrated))
    (NP
      (NP
        (NP (DT the) (JJ low) (NNS defaults) (NN portfolio))
        (PP (TO to)
          (NP (JJ benchmark) (NNS ratings))))
      (, ,)
      (VP (VBN performed)
        (ADVP (RB back))))))

要根據輸入字符串檢索VP字符串,您必須使用characterOffsetBegincharacterOffsetEnd遍歷JSON輸出:

>>> output['sentences'][0]
{u'tokens': [{u'index': 1, u'word': u'Selected', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 8, u'characterOffsetBegin': 0, u'originalText': u'Selected', u'before': u''}, {u'index': 2, u'word': u'variables', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 18, u'characterOffsetBegin': 9, u'originalText': u'variables', u'before': u' '}, {u'index': 3, u'word': u'by', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 21, u'characterOffsetBegin': 19, u'originalText': u'by', u'before': u' '}, {u'index': 4, u'word': u'univariate/multivariate', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 45, u'characterOffsetBegin': 22, u'originalText': u'univariate/multivariate', u'before': u' '}, {u'index': 5, u'word': u'analysis', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 54, u'characterOffsetBegin': 46, u'originalText': u'analysis', u'before': u' '}, {u'index': 6, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 55, u'characterOffsetBegin': 54, u'originalText': u',', u'before': u''}, {u'index': 7, u'word': u'constructed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 67, u'characterOffsetBegin': 56, u'originalText': u'constructed', u'before': u' '}, {u'index': 8, u'word': u'logistic', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 76, u'characterOffsetBegin': 68, u'originalText': u'logistic', u'before': u' '}, {u'index': 9, u'word': u'regression', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 87, u'characterOffsetBegin': 77, u'originalText': u'regression', u'before': u' '}, {u'index': 10, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 88, u'characterOffsetBegin': 87, u'originalText': u',', u'before': u''}, {u'index': 11, u'word': u'calibrated', u'after': u' ', u'pos': u'VBD', u'characterOffsetEnd': 99, u'characterOffsetBegin': 89, u'originalText': u'calibrated', u'before': u' '}, {u'index': 12, u'word': u'the', u'after': u' ', u'pos': u'DT', u'characterOffsetEnd': 103, u'characterOffsetBegin': 100, u'originalText': u'the', u'before': u' '}, {u'index': 13, u'word': u'low', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 107, u'characterOffsetBegin': 104, u'originalText': u'low', u'before': u' '}, {u'index': 14, u'word': u'defaults', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 116, u'characterOffsetBegin': 108, u'originalText': u'defaults', u'before': u' '}, {u'index': 15, u'word': u'portfolio', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 126, u'characterOffsetBegin': 117, u'originalText': u'portfolio', u'before': u' '}, {u'index': 16, u'word': u'to', u'after': u' ', u'pos': u'TO', u'characterOffsetEnd': 129, u'characterOffsetBegin': 127, u'originalText': u'to', u'before': u' '}, {u'index': 17, u'word': u'benchmark', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 139, u'characterOffsetBegin': 130, u'originalText': u'benchmark', u'before': u' '}, {u'index': 18, u'word': u'ratings', u'after': u'', u'pos': u'NNS', u'characterOffsetEnd': 147, u'characterOffsetBegin': 140, u'originalText': u'ratings', u'before': u' '}, {u'index': 19, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 148, u'characterOffsetBegin': 147, u'originalText': u',', u'before': u''}, {u'index': 20, u'word': u'performed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 158, u'characterOffsetBegin': 149, u'originalText': u'performed', u'before': u' '}, {u'index': 21, u'word': u'back', u'after': u'', u'pos': u'RB', u'characterOffsetEnd': 163, u'characterOffsetBegin': 159, u'originalText': u'back', u'before': u' '}], u'index': 0, u'basic-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}], u'parse': u'(ROOT\n  (SINV\n    (VP (VBN Selected)\n      (NP (NNS variables))\n      (PP (IN by)\n        (NP\n          (NP (JJ univariate/multivariate) (NN analysis))\n          (, ,)\n          (VP (VBN constructed)\n            (NP (JJ logistic) (NN regression)))\n          (, ,))))\n    (VP (VBD calibrated))\n    (NP\n      (NP\n        (NP (DT the) (JJ low) (NNS defaults) (NN portfolio))\n        (PP (TO to)\n          (NP (JJ benchmark) (NNS ratings))))\n      (, ,)\n      (VP (VBN performed)\n        (ADVP (RB back))))))', u'collapsed-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod:by', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod:to', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}], u'collapsed-ccprocessed-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod:by', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod:to', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}]}

但這似乎不是解析以獲取字符偏移量的簡單輸出,因為解析樹與偏移量沒有直接鏈接。 只有相關性三元組包含鏈接到偏移量的單詞ID的鏈接。


要訪問output['sentences'][0]['tokens']的令牌以及'after''before'鍵(但遺憾的是,沒有直接鏈接到解析樹):

>>> tokens = output['sentences'][0]['tokens']
>>> tokens
[{u'index': 1, u'word': u'Selected', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 8, u'characterOffsetBegin': 0, u'originalText': u'Selected', u'before': u''}, {u'index': 2, u'word': u'variables', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 18, u'characterOffsetBegin': 9, u'originalText': u'variables', u'before': u' '}, {u'index': 3, u'word': u'by', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 21, u'characterOffsetBegin': 19, u'originalText': u'by', u'before': u' '}, {u'index': 4, u'word': u'univariate/multivariate', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 45, u'characterOffsetBegin': 22, u'originalText': u'univariate/multivariate', u'before': u' '}, {u'index': 5, u'word': u'analysis', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 54, u'characterOffsetBegin': 46, u'originalText': u'analysis', u'before': u' '}, {u'index': 6, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 55, u'characterOffsetBegin': 54, u'originalText': u',', u'before': u''}, {u'index': 7, u'word': u'constructed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 67, u'characterOffsetBegin': 56, u'originalText': u'constructed', u'before': u' '}, {u'index': 8, u'word': u'logistic', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 76, u'characterOffsetBegin': 68, u'originalText': u'logistic', u'before': u' '}, {u'index': 9, u'word': u'regression', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 87, u'characterOffsetBegin': 77, u'originalText': u'regression', u'before': u' '}, {u'index': 10, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 88, u'characterOffsetBegin': 87, u'originalText': u',', u'before': u''}, {u'index': 11, u'word': u'calibrated', u'after': u' ', u'pos': u'VBD', u'characterOffsetEnd': 99, u'characterOffsetBegin': 89, u'originalText': u'calibrated', u'before': u' '}, {u'index': 12, u'word': u'the', u'after': u' ', u'pos': u'DT', u'characterOffsetEnd': 103, u'characterOffsetBegin': 100, u'originalText': u'the', u'before': u' '}, {u'index': 13, u'word': u'low', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 107, u'characterOffsetBegin': 104, u'originalText': u'low', u'before': u' '}, {u'index': 14, u'word': u'defaults', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 116, u'characterOffsetBegin': 108, u'originalText': u'defaults', u'before': u' '}, {u'index': 15, u'word': u'portfolio', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 126, u'characterOffsetBegin': 117, u'originalText': u'portfolio', u'before': u' '}, {u'index': 16, u'word': u'to', u'after': u' ', u'pos': u'TO', u'characterOffsetEnd': 129, u'characterOffsetBegin': 127, u'originalText': u'to', u'before': u' '}, {u'index': 17, u'word': u'benchmark', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 139, u'characterOffsetBegin': 130, u'originalText': u'benchmark', u'before': u' '}, {u'index': 18, u'word': u'ratings', u'after': u'', u'pos': u'NNS', u'characterOffsetEnd': 147, u'characterOffsetBegin': 140, u'originalText': u'ratings', u'before': u' '}, {u'index': 19, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 148, u'characterOffsetBegin': 147, u'originalText': u',', u'before': u''}, {u'index': 20, u'word': u'performed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 158, u'characterOffsetBegin': 149, u'originalText': u'performed', u'before': u' '}, {u'index': 21, u'word': u'back', u'after': u'', u'pos': u'RB', u'characterOffsetEnd': 163, u'characterOffsetBegin': 159, u'originalText': u'back', u'before': u' '}]

NLTKStanfordParser無關,獲取正常閱讀文本的一種方法是使用來自Moses SMT( https://github.com/moses-smt/mosesdecoder )的腳本對輸出進行“解密” ,例如:

alvas@ubi:~$ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/detokenizer.perl
--2016-02-13 21:27:12--  https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/detokenizer.perl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 23.235.43.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|23.235.43.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12473 (12K) [text/plain]
Saving to: ‘detokenizer.perl’

100%[===============================================================================================================================>] 12,473      --.-K/s   in 0s      

2016-02-13 21:27:12 (150 MB/s) - ‘detokenizer.perl’ saved [12473/12473]

alvas@ubi:~$ echo "constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings" 2> /tmp/null
constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings

請注意,輸出的MIGHT可能與輸入的不同,但是在大多數情況下,英語會轉換為我們讀/寫的普通文本。

在NLTK中具有detokenizer正在detokenizer ,但我們需要花一些時間對其進行編碼,測試並將其推送到存儲庫中,請耐心等待(請參閱https://github.com/nltk/nltk) / issues 1214

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM