[英]Getting NLTK tree leaf values as a string
我正在嘗試獲取Tree對象中的葉子值作為字符串。 這里的樹對象是斯坦福解析器的輸出。
這是我的代碼:
from nltk.parse import stanford
Parser = stanford.StanfordParser("path")
example = "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back"
sentences = Parser.raw_parse(example)
for line in sentences:
for sentence in line:
tree = sentence
這就是我提取VP(動詞短語)葉子的方法。
VP=[]
VP_tree = list(tree.subtrees(filter=lambda x: x.label()=='VP'))
for i in VP_tree:
VP.append(' '.join(i.flatten()))
這是i.flatten()的樣子:(它返回已解析的單詞列表)
(VP
constructed
logistic
regression
,
calibrated
the
low
defaults
portfolio
to
benchmark
ratings)
因為我只能將它們作為已解析單詞的列表來獲取,所以我將它們與'一起加入。 因此,“回歸”和“,”之間有一個空格。
In [33]: VP
Out [33]: [u'constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings']
我想將動詞短語作為字符串(而不是作為解析詞的列表),而不必通過''將它們連接起來。
我看過Tree類( http://www.nltk.org/_modules/nltk/tree.html )下的方法,但到目前為止還算不上運氣。
簡而言之:
使用Tree.leaves()
函數可訪問已分析句子Tree.leaves()
樹的字符串,即:
VPs_str = [" ".join(vp.leaves()) for vp in list(parsed_sent.subtrees(filter=lambda x: x.label()=='VP'))]
沒有正確的方法來訪問真正的VP字符串,因為它們在輸入中是因為Stanford解析器在解析過程之前對文本進行了標記化,並且NLTK API =(不保留字符串的偏移量)
在長:
這個長答案是這樣的,其他NLTK用戶可以使用Stanford Parser的NLTK API來訪問Tree
對象,這可能不是問題=)中顯示的那么瑣碎。
首先為NLTK設置訪問Stanford工具的環境變量,請參閱:
TL; DR :
$ cd
$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-12-09.zip
$ unzip stanford-parser-full-2015-12-09.zip
$ export STANFORDTOOLSDIR=$HOME
$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar
應用針對2015年12月9日編譯的Stanford Parser的駭客(此駭客將在最新版本的https://github.com/nltk/nltk/pull/1280/files中不再使用):
>>> from nltk.internals import find_jars_within_path
>>> from nltk.parse.stanford import StanfordParser
>>> parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
>>> stanford_dir = parser._classpath[0].rpartition('/')[0]
>>> parser._classpath = tuple(find_jars_within_path(stanford_dir))
現在到短語提取。
首先,我們分析句子:
>>> sent = "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back"
>>> parsed_sent = list(parser.raw_parse(sent))[0]
>>> parsed_sent
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('JJ', ['Selected']), Tree('NNS', ['variables'])]), Tree('PP', [Tree('IN', ['by']), Tree('NP', [Tree('JJ', ['univariate/multivariate']), Tree('NN', ['analysis'])])]), Tree(',', [',']), Tree('VP', [Tree('VBN', ['constructed']), Tree('NP', [Tree('NP', [Tree('JJ', ['logistic']), Tree('NN', ['regression'])]), Tree(',', [',']), Tree('ADJP', [Tree('VBN', ['calibrated']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['low']), Tree('NNS', ['defaults']), Tree('NN', ['portfolio'])]), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('JJ', ['benchmark']), Tree('NNS', ['ratings'])])])])])])]), Tree(',', [','])]), Tree('VP', [Tree('VBD', ['performed']), Tree('ADVP', [Tree('RB', ['back'])])])])])
然后,我們遍歷該樹並按照以下步驟檢查VP:
>>> VP_tree = list(tree.subtrees(filter=lambda x: x.label()=='VP'))
事后,我們僅使用子樹葉子來獲取VP
>>> for vp in VPs:
... print " ".join(vp.leaves())
...
constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings
performed back
因此,獲得VP字符串:
>>> VPs_str = [" ".join(vp.leaves()) for vp in list(parsed_sent.subtrees(filter=lambda x: x.label()=='VP'))]
>>> VPs_str
[u'constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings', u'performed back']
另外,我個人更喜歡使用分塊器而不是完整的解析器來提取短語。
使用nltk_cli
工具( https://github.com/alvations/nltk_cli ):
alvas@ubi:~/git/nltk_cli$ echo "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back" > input-doneyo.txt
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk VP input-doneyo.txt calibrated|to benchmark|performed
alvas@ubi:~/git/nltk_cli$ python senna.py --vp input-doneyo.txt
calibrated|to benchmark|performed
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+NP input-doneyo.txt
calibrated the low defaults portfolio|to benchmark ratings
VP標簽的輸出由|
分隔|
,即
輸出:
calibrated|to benchmark|performed
代表:
VP + NP塊輸出也由|
分隔|
VP和NP之間用\\t
隔開,即
輸出:
calibrated the low defaults portfolio|to benchmark ratings
代表(VP + NP):
要根據輸入位置檢索字符串,應考慮使用https://github.com/smilli/py-corenlp代替斯坦福工具的NLTK API。
首先,您必須下載,安裝和設置Stanford CoreNLP,請參閱http://stanfordnlp.github.io/CoreNLP/corenlp-server.html#getting-started
然后將python包裝器安裝到CoreNLP, https://github.com/smilli/py-corenlp
然后, 在啟動服務器后 (許多人錯過了這一步!),在python中,您可以執行以下操作:
>>> from pycorenlp import StanfordCoreNLP
>>> stanford = StanfordCoreNLP('http://localhost:9000')
>>> text = ("Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back")
>>> output = stanford.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,depparse,parse', 'outputFormat': 'json'})
>>> print(output['sentences'][0]['parse'])
(ROOT
(SINV
(VP (VBN Selected)
(NP (NNS variables))
(PP (IN by)
(NP
(NP (JJ univariate/multivariate) (NN analysis))
(, ,)
(VP (VBN constructed)
(NP (JJ logistic) (NN regression)))
(, ,))))
(VP (VBD calibrated))
(NP
(NP
(NP (DT the) (JJ low) (NNS defaults) (NN portfolio))
(PP (TO to)
(NP (JJ benchmark) (NNS ratings))))
(, ,)
(VP (VBN performed)
(ADVP (RB back))))))
要根據輸入字符串檢索VP字符串,您必須使用characterOffsetBegin
和characterOffsetEnd
遍歷JSON輸出:
>>> output['sentences'][0]
{u'tokens': [{u'index': 1, u'word': u'Selected', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 8, u'characterOffsetBegin': 0, u'originalText': u'Selected', u'before': u''}, {u'index': 2, u'word': u'variables', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 18, u'characterOffsetBegin': 9, u'originalText': u'variables', u'before': u' '}, {u'index': 3, u'word': u'by', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 21, u'characterOffsetBegin': 19, u'originalText': u'by', u'before': u' '}, {u'index': 4, u'word': u'univariate/multivariate', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 45, u'characterOffsetBegin': 22, u'originalText': u'univariate/multivariate', u'before': u' '}, {u'index': 5, u'word': u'analysis', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 54, u'characterOffsetBegin': 46, u'originalText': u'analysis', u'before': u' '}, {u'index': 6, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 55, u'characterOffsetBegin': 54, u'originalText': u',', u'before': u''}, {u'index': 7, u'word': u'constructed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 67, u'characterOffsetBegin': 56, u'originalText': u'constructed', u'before': u' '}, {u'index': 8, u'word': u'logistic', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 76, u'characterOffsetBegin': 68, u'originalText': u'logistic', u'before': u' '}, {u'index': 9, u'word': u'regression', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 87, u'characterOffsetBegin': 77, u'originalText': u'regression', u'before': u' '}, {u'index': 10, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 88, u'characterOffsetBegin': 87, u'originalText': u',', u'before': u''}, {u'index': 11, u'word': u'calibrated', u'after': u' ', u'pos': u'VBD', u'characterOffsetEnd': 99, u'characterOffsetBegin': 89, u'originalText': u'calibrated', u'before': u' '}, {u'index': 12, u'word': u'the', u'after': u' ', u'pos': u'DT', u'characterOffsetEnd': 103, u'characterOffsetBegin': 100, u'originalText': u'the', u'before': u' '}, {u'index': 13, u'word': u'low', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 107, u'characterOffsetBegin': 104, u'originalText': u'low', u'before': u' '}, {u'index': 14, u'word': u'defaults', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 116, u'characterOffsetBegin': 108, u'originalText': u'defaults', u'before': u' '}, {u'index': 15, u'word': u'portfolio', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 126, u'characterOffsetBegin': 117, u'originalText': u'portfolio', u'before': u' '}, {u'index': 16, u'word': u'to', u'after': u' ', u'pos': u'TO', u'characterOffsetEnd': 129, u'characterOffsetBegin': 127, u'originalText': u'to', u'before': u' '}, {u'index': 17, u'word': u'benchmark', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 139, u'characterOffsetBegin': 130, u'originalText': u'benchmark', u'before': u' '}, {u'index': 18, u'word': u'ratings', u'after': u'', u'pos': u'NNS', u'characterOffsetEnd': 147, u'characterOffsetBegin': 140, u'originalText': u'ratings', u'before': u' '}, {u'index': 19, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 148, u'characterOffsetBegin': 147, u'originalText': u',', u'before': u''}, {u'index': 20, u'word': u'performed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 158, u'characterOffsetBegin': 149, u'originalText': u'performed', u'before': u' '}, {u'index': 21, u'word': u'back', u'after': u'', u'pos': u'RB', u'characterOffsetEnd': 163, u'characterOffsetBegin': 159, u'originalText': u'back', u'before': u' '}], u'index': 0, u'basic-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}], u'parse': u'(ROOT\n (SINV\n (VP (VBN Selected)\n (NP (NNS variables))\n (PP (IN by)\n (NP\n (NP (JJ univariate/multivariate) (NN analysis))\n (, ,)\n (VP (VBN constructed)\n (NP (JJ logistic) (NN regression)))\n (, ,))))\n (VP (VBD calibrated))\n (NP\n (NP\n (NP (DT the) (JJ low) (NNS defaults) (NN portfolio))\n (PP (TO to)\n (NP (JJ benchmark) (NNS ratings))))\n (, ,)\n (VP (VBN performed)\n (ADVP (RB back))))))', u'collapsed-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod:by', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod:to', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}], u'collapsed-ccprocessed-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod:by', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod:to', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}]}
但這似乎不是解析以獲取字符偏移量的簡單輸出,因為解析樹與偏移量沒有直接鏈接。 只有相關性三元組包含鏈接到偏移量的單詞ID的鏈接。
要訪問output['sentences'][0]['tokens']
的令牌以及'after'
和'before'
鍵(但遺憾的是,沒有直接鏈接到解析樹):
>>> tokens = output['sentences'][0]['tokens']
>>> tokens
[{u'index': 1, u'word': u'Selected', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 8, u'characterOffsetBegin': 0, u'originalText': u'Selected', u'before': u''}, {u'index': 2, u'word': u'variables', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 18, u'characterOffsetBegin': 9, u'originalText': u'variables', u'before': u' '}, {u'index': 3, u'word': u'by', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 21, u'characterOffsetBegin': 19, u'originalText': u'by', u'before': u' '}, {u'index': 4, u'word': u'univariate/multivariate', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 45, u'characterOffsetBegin': 22, u'originalText': u'univariate/multivariate', u'before': u' '}, {u'index': 5, u'word': u'analysis', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 54, u'characterOffsetBegin': 46, u'originalText': u'analysis', u'before': u' '}, {u'index': 6, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 55, u'characterOffsetBegin': 54, u'originalText': u',', u'before': u''}, {u'index': 7, u'word': u'constructed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 67, u'characterOffsetBegin': 56, u'originalText': u'constructed', u'before': u' '}, {u'index': 8, u'word': u'logistic', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 76, u'characterOffsetBegin': 68, u'originalText': u'logistic', u'before': u' '}, {u'index': 9, u'word': u'regression', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 87, u'characterOffsetBegin': 77, u'originalText': u'regression', u'before': u' '}, {u'index': 10, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 88, u'characterOffsetBegin': 87, u'originalText': u',', u'before': u''}, {u'index': 11, u'word': u'calibrated', u'after': u' ', u'pos': u'VBD', u'characterOffsetEnd': 99, u'characterOffsetBegin': 89, u'originalText': u'calibrated', u'before': u' '}, {u'index': 12, u'word': u'the', u'after': u' ', u'pos': u'DT', u'characterOffsetEnd': 103, u'characterOffsetBegin': 100, u'originalText': u'the', u'before': u' '}, {u'index': 13, u'word': u'low', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 107, u'characterOffsetBegin': 104, u'originalText': u'low', u'before': u' '}, {u'index': 14, u'word': u'defaults', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 116, u'characterOffsetBegin': 108, u'originalText': u'defaults', u'before': u' '}, {u'index': 15, u'word': u'portfolio', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 126, u'characterOffsetBegin': 117, u'originalText': u'portfolio', u'before': u' '}, {u'index': 16, u'word': u'to', u'after': u' ', u'pos': u'TO', u'characterOffsetEnd': 129, u'characterOffsetBegin': 127, u'originalText': u'to', u'before': u' '}, {u'index': 17, u'word': u'benchmark', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 139, u'characterOffsetBegin': 130, u'originalText': u'benchmark', u'before': u' '}, {u'index': 18, u'word': u'ratings', u'after': u'', u'pos': u'NNS', u'characterOffsetEnd': 147, u'characterOffsetBegin': 140, u'originalText': u'ratings', u'before': u' '}, {u'index': 19, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 148, u'characterOffsetBegin': 147, u'originalText': u',', u'before': u''}, {u'index': 20, u'word': u'performed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 158, u'characterOffsetBegin': 149, u'originalText': u'performed', u'before': u' '}, {u'index': 21, u'word': u'back', u'after': u'', u'pos': u'RB', u'characterOffsetEnd': 163, u'characterOffsetBegin': 159, u'originalText': u'back', u'before': u' '}]
與NLTK
或StanfordParser
無關,獲取正常閱讀文本的一種方法是使用來自Moses SMT( https://github.com/moses-smt/mosesdecoder )的腳本對輸出進行“解密” ,例如:
alvas@ubi:~$ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/detokenizer.perl
--2016-02-13 21:27:12-- https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/detokenizer.perl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 23.235.43.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|23.235.43.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12473 (12K) [text/plain]
Saving to: ‘detokenizer.perl’
100%[===============================================================================================================================>] 12,473 --.-K/s in 0s
2016-02-13 21:27:12 (150 MB/s) - ‘detokenizer.perl’ saved [12473/12473]
alvas@ubi:~$ echo "constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings" 2> /tmp/null
constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings
請注意,輸出的MIGHT可能與輸入的不同,但是在大多數情況下,英語會轉換為我們讀/寫的普通文本。
在NLTK中具有detokenizer
正在detokenizer
,但我們需要花一些時間對其進行編碼,測試並將其推送到存儲庫中,請耐心等待(請參閱https://github.com/nltk/nltk) / issues 1214 )
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.