简体   繁体   中英

Getting NLTK tree leaf values as a string

I'm trying to get leaf values in the Tree object as a string. The tree object here is the output of the Stanford Parser.

Here is my code:

from nltk.parse import stanford
Parser = stanford.StanfordParser("path")


example = "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back"
sentences = Parser.raw_parse(example)
for line in sentences:
    for sentence in line:
        tree = sentence

And this is how I extract the VP (Verb Phrases) leaves.

VP=[]

VP_tree = list(tree.subtrees(filter=lambda x: x.label()=='VP'))

for i in VP_tree:
    VP.append(' '.join(i.flatten()))

Here is what i.flatten() looks like: (it returns parsed word list)

(VP
  constructed
  logistic
  regression
  ,
  calibrated
  the
  low
  defaults
  portfolio
  to
  benchmark
  ratings)

Becuase I could only get them as a list of parsed words, I joined them with ' '. Therefore there is a space between 'regression' and ','.

In [33]: VP
Out [33]: [u'constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings']

I'd like to get the Verb Phrase as a string (not as a list of parsed words) without having to join them by ' '.

I have looked at methods under Tree class ( http://www.nltk.org/_modules/nltk/tree.html ) however got no luck so far.

In Short:

Use the Tree.leaves() function to access the strings of the subtrees within the parsed sentence, ie:

VPs_str = [" ".join(vp.leaves()) for vp in list(parsed_sent.subtrees(filter=lambda x: x.label()=='VP'))]

There's no proper way to access the true VP strings as they were in the input because Stanford parser tokenized the text before the parsing process and the offset of the strings were not kept by the NLTK API =(


In Long:

This long answer is such that other NLTK users can get to the point of accessing the Tree object using NLTK API to the Stanford Parser, it might not be as trivial as shown in the question =)

First setup the environmental variables for NLTK to access the Stanford tools, see:

TL;DR :

$ cd
$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-12-09.zip
$ unzip stanford-parser-full-2015-12-09.zip
$ export STANFORDTOOLSDIR=$HOME
$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar

Apply the hack for Stanford Parser compiled on 2015-12-09 (this hack will become obsolete in the bleeding edge version with https://github.com/nltk/nltk/pull/1280/files ):

>>> from nltk.internals import find_jars_within_path
>>> from nltk.parse.stanford import StanfordParser
>>> parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
>>> stanford_dir = parser._classpath[0].rpartition('/')[0]
>>> parser._classpath = tuple(find_jars_within_path(stanford_dir))

Now to the phrase extraction.

First, we parse the sentence:

>>> sent = "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back"
>>> parsed_sent = list(parser.raw_parse(sent))[0]
>>> parsed_sent
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('JJ', ['Selected']), Tree('NNS', ['variables'])]), Tree('PP', [Tree('IN', ['by']), Tree('NP', [Tree('JJ', ['univariate/multivariate']), Tree('NN', ['analysis'])])]), Tree(',', [',']), Tree('VP', [Tree('VBN', ['constructed']), Tree('NP', [Tree('NP', [Tree('JJ', ['logistic']), Tree('NN', ['regression'])]), Tree(',', [',']), Tree('ADJP', [Tree('VBN', ['calibrated']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['low']), Tree('NNS', ['defaults']), Tree('NN', ['portfolio'])]), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('JJ', ['benchmark']), Tree('NNS', ['ratings'])])])])])])]), Tree(',', [','])]), Tree('VP', [Tree('VBD', ['performed']), Tree('ADVP', [Tree('RB', ['back'])])])])])

Then we traverse the tree and check for VP as you have done with:

>>> VP_tree = list(tree.subtrees(filter=lambda x: x.label()=='VP'))

Aftewards, we simply use the subtree leaves to get the VPs

>>> for vp in VPs:
...     print " ".join(vp.leaves())
... 
constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings
performed back

So to get the VP strings:

>>> VPs_str = [" ".join(vp.leaves()) for vp in list(parsed_sent.subtrees(filter=lambda x: x.label()=='VP'))]
>>> VPs_str
[u'constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings', u'performed back']

Alternatively, I, personally, like to use a chunker instead of a full blown parser for extracting phrases.

Using the nltk_cli tool ( https://github.com/alvations/nltk_cli ):

alvas@ubi:~/git/nltk_cli$ echo "Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back" > input-doneyo.txt
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk VP input-doneyo.txt calibrated|to benchmark|performed
alvas@ubi:~/git/nltk_cli$ python senna.py --vp input-doneyo.txt 
calibrated|to benchmark|performed
alvas@ubi:~/git/nltk_cli$ python senna.py --chunk2 VP+NP input-doneyo.txt 
calibrated  the low defaults portfolio|to benchmark ratings

The outputs of the VP tags are separated by | , ie

Output:

calibrated|to benchmark|performed

Represents:

  • calibrated
  • to benchmark
  • performed

And the VP+NP chunks output are also separated by | and the VP and NP are separated by \\t , ie

Output:

calibrated  the low defaults portfolio|to benchmark ratings

Represents (VP + NP) :

  • calibrated + the low defaults portfolio
  • to benchmark + ratings

To retrieve the strings as per the input positions, you should consider using https://github.com/smilli/py-corenlp instead of the NLTK API to Stanford tools.

First you have to download, install and setup the Stanford CoreNLP, see http://stanfordnlp.github.io/CoreNLP/corenlp-server.html#getting-started

Then install the python wrapper to CoreNLP, https://github.com/smilli/py-corenlp

Then, after starting the server (many people miss this step!), in python, you can do this:

>>> from pycorenlp import StanfordCoreNLP
>>> stanford = StanfordCoreNLP('http://localhost:9000')
>>> text = ("Selected variables by univariate/multivariate analysis, constructed logistic regression, calibrated the low defaults portfolio to benchmark ratings, performed back")
>>> output = stanford.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,depparse,parse', 'outputFormat': 'json'})
>>> print(output['sentences'][0]['parse'])
(ROOT
  (SINV
    (VP (VBN Selected)
      (NP (NNS variables))
      (PP (IN by)
        (NP
          (NP (JJ univariate/multivariate) (NN analysis))
          (, ,)
          (VP (VBN constructed)
            (NP (JJ logistic) (NN regression)))
          (, ,))))
    (VP (VBD calibrated))
    (NP
      (NP
        (NP (DT the) (JJ low) (NNS defaults) (NN portfolio))
        (PP (TO to)
          (NP (JJ benchmark) (NNS ratings))))
      (, ,)
      (VP (VBN performed)
        (ADVP (RB back))))))

To retrieve the VP strings as per the input string, you would have to traverse the JSON output using the characterOffsetBegin and characterOffsetEnd :

>>> output['sentences'][0]
{u'tokens': [{u'index': 1, u'word': u'Selected', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 8, u'characterOffsetBegin': 0, u'originalText': u'Selected', u'before': u''}, {u'index': 2, u'word': u'variables', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 18, u'characterOffsetBegin': 9, u'originalText': u'variables', u'before': u' '}, {u'index': 3, u'word': u'by', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 21, u'characterOffsetBegin': 19, u'originalText': u'by', u'before': u' '}, {u'index': 4, u'word': u'univariate/multivariate', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 45, u'characterOffsetBegin': 22, u'originalText': u'univariate/multivariate', u'before': u' '}, {u'index': 5, u'word': u'analysis', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 54, u'characterOffsetBegin': 46, u'originalText': u'analysis', u'before': u' '}, {u'index': 6, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 55, u'characterOffsetBegin': 54, u'originalText': u',', u'before': u''}, {u'index': 7, u'word': u'constructed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 67, u'characterOffsetBegin': 56, u'originalText': u'constructed', u'before': u' '}, {u'index': 8, u'word': u'logistic', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 76, u'characterOffsetBegin': 68, u'originalText': u'logistic', u'before': u' '}, {u'index': 9, u'word': u'regression', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 87, u'characterOffsetBegin': 77, u'originalText': u'regression', u'before': u' '}, {u'index': 10, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 88, u'characterOffsetBegin': 87, u'originalText': u',', u'before': u''}, {u'index': 11, u'word': u'calibrated', u'after': u' ', u'pos': u'VBD', u'characterOffsetEnd': 99, u'characterOffsetBegin': 89, u'originalText': u'calibrated', u'before': u' '}, {u'index': 12, u'word': u'the', u'after': u' ', u'pos': u'DT', u'characterOffsetEnd': 103, u'characterOffsetBegin': 100, u'originalText': u'the', u'before': u' '}, {u'index': 13, u'word': u'low', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 107, u'characterOffsetBegin': 104, u'originalText': u'low', u'before': u' '}, {u'index': 14, u'word': u'defaults', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 116, u'characterOffsetBegin': 108, u'originalText': u'defaults', u'before': u' '}, {u'index': 15, u'word': u'portfolio', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 126, u'characterOffsetBegin': 117, u'originalText': u'portfolio', u'before': u' '}, {u'index': 16, u'word': u'to', u'after': u' ', u'pos': u'TO', u'characterOffsetEnd': 129, u'characterOffsetBegin': 127, u'originalText': u'to', u'before': u' '}, {u'index': 17, u'word': u'benchmark', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 139, u'characterOffsetBegin': 130, u'originalText': u'benchmark', u'before': u' '}, {u'index': 18, u'word': u'ratings', u'after': u'', u'pos': u'NNS', u'characterOffsetEnd': 147, u'characterOffsetBegin': 140, u'originalText': u'ratings', u'before': u' '}, {u'index': 19, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 148, u'characterOffsetBegin': 147, u'originalText': u',', u'before': u''}, {u'index': 20, u'word': u'performed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 158, u'characterOffsetBegin': 149, u'originalText': u'performed', u'before': u' '}, {u'index': 21, u'word': u'back', u'after': u'', u'pos': u'RB', u'characterOffsetEnd': 163, u'characterOffsetBegin': 159, u'originalText': u'back', u'before': u' '}], u'index': 0, u'basic-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}], u'parse': u'(ROOT\n  (SINV\n    (VP (VBN Selected)\n      (NP (NNS variables))\n      (PP (IN by)\n        (NP\n          (NP (JJ univariate/multivariate) (NN analysis))\n          (, ,)\n          (VP (VBN constructed)\n            (NP (JJ logistic) (NN regression)))\n          (, ,))))\n    (VP (VBD calibrated))\n    (NP\n      (NP\n        (NP (DT the) (JJ low) (NNS defaults) (NN portfolio))\n        (PP (TO to)\n          (NP (JJ benchmark) (NNS ratings))))\n      (, ,)\n      (VP (VBN performed)\n        (ADVP (RB back))))))', u'collapsed-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod:by', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod:to', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}], u'collapsed-ccprocessed-dependencies': [{u'dep': u'ROOT', u'dependent': 1, u'governorGloss': u'ROOT', u'governor': 0, u'dependentGloss': u'Selected'}, {u'dep': u'dobj', u'dependent': 2, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'variables'}, {u'dep': u'case', u'dependent': 3, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'by'}, {u'dep': u'amod', u'dependent': 4, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'univariate/multivariate'}, {u'dep': u'nmod:by', u'dependent': 5, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'analysis'}, {u'dep': u'punct', u'dependent': 6, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 7, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u'constructed'}, {u'dep': u'amod', u'dependent': 8, u'governorGloss': u'regression', u'governor': 9, u'dependentGloss': u'logistic'}, {u'dep': u'dobj', u'dependent': 9, u'governorGloss': u'constructed', u'governor': 7, u'dependentGloss': u'regression'}, {u'dep': u'punct', u'dependent': 10, u'governorGloss': u'analysis', u'governor': 5, u'dependentGloss': u','}, {u'dep': u'dep', u'dependent': 11, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'calibrated'}, {u'dep': u'det', u'dependent': 12, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'the'}, {u'dep': u'amod', u'dependent': 13, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'low'}, {u'dep': u'compound', u'dependent': 14, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'defaults'}, {u'dep': u'nsubj', u'dependent': 15, u'governorGloss': u'Selected', u'governor': 1, u'dependentGloss': u'portfolio'}, {u'dep': u'case', u'dependent': 16, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'to'}, {u'dep': u'amod', u'dependent': 17, u'governorGloss': u'ratings', u'governor': 18, u'dependentGloss': u'benchmark'}, {u'dep': u'nmod:to', u'dependent': 18, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'ratings'}, {u'dep': u'punct', u'dependent': 19, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u','}, {u'dep': u'acl', u'dependent': 20, u'governorGloss': u'portfolio', u'governor': 15, u'dependentGloss': u'performed'}, {u'dep': u'advmod', u'dependent': 21, u'governorGloss': u'performed', u'governor': 20, u'dependentGloss': u'back'}]}

But it doesn't seem to be an easy output to parse to get the character offset since there's no direct link of the parse tree to the offset. Only the dependency triples contains the link to the word ID that links to the offset.


To access the tokens and 'after' and 'before' keys in output['sentences'][0]['tokens'] (but sadly no directly link to the parse tree):

>>> tokens = output['sentences'][0]['tokens']
>>> tokens
[{u'index': 1, u'word': u'Selected', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 8, u'characterOffsetBegin': 0, u'originalText': u'Selected', u'before': u''}, {u'index': 2, u'word': u'variables', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 18, u'characterOffsetBegin': 9, u'originalText': u'variables', u'before': u' '}, {u'index': 3, u'word': u'by', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 21, u'characterOffsetBegin': 19, u'originalText': u'by', u'before': u' '}, {u'index': 4, u'word': u'univariate/multivariate', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 45, u'characterOffsetBegin': 22, u'originalText': u'univariate/multivariate', u'before': u' '}, {u'index': 5, u'word': u'analysis', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 54, u'characterOffsetBegin': 46, u'originalText': u'analysis', u'before': u' '}, {u'index': 6, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 55, u'characterOffsetBegin': 54, u'originalText': u',', u'before': u''}, {u'index': 7, u'word': u'constructed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 67, u'characterOffsetBegin': 56, u'originalText': u'constructed', u'before': u' '}, {u'index': 8, u'word': u'logistic', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 76, u'characterOffsetBegin': 68, u'originalText': u'logistic', u'before': u' '}, {u'index': 9, u'word': u'regression', u'after': u'', u'pos': u'NN', u'characterOffsetEnd': 87, u'characterOffsetBegin': 77, u'originalText': u'regression', u'before': u' '}, {u'index': 10, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 88, u'characterOffsetBegin': 87, u'originalText': u',', u'before': u''}, {u'index': 11, u'word': u'calibrated', u'after': u' ', u'pos': u'VBD', u'characterOffsetEnd': 99, u'characterOffsetBegin': 89, u'originalText': u'calibrated', u'before': u' '}, {u'index': 12, u'word': u'the', u'after': u' ', u'pos': u'DT', u'characterOffsetEnd': 103, u'characterOffsetBegin': 100, u'originalText': u'the', u'before': u' '}, {u'index': 13, u'word': u'low', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 107, u'characterOffsetBegin': 104, u'originalText': u'low', u'before': u' '}, {u'index': 14, u'word': u'defaults', u'after': u' ', u'pos': u'NNS', u'characterOffsetEnd': 116, u'characterOffsetBegin': 108, u'originalText': u'defaults', u'before': u' '}, {u'index': 15, u'word': u'portfolio', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 126, u'characterOffsetBegin': 117, u'originalText': u'portfolio', u'before': u' '}, {u'index': 16, u'word': u'to', u'after': u' ', u'pos': u'TO', u'characterOffsetEnd': 129, u'characterOffsetBegin': 127, u'originalText': u'to', u'before': u' '}, {u'index': 17, u'word': u'benchmark', u'after': u' ', u'pos': u'JJ', u'characterOffsetEnd': 139, u'characterOffsetBegin': 130, u'originalText': u'benchmark', u'before': u' '}, {u'index': 18, u'word': u'ratings', u'after': u'', u'pos': u'NNS', u'characterOffsetEnd': 147, u'characterOffsetBegin': 140, u'originalText': u'ratings', u'before': u' '}, {u'index': 19, u'word': u',', u'after': u' ', u'pos': u',', u'characterOffsetEnd': 148, u'characterOffsetBegin': 147, u'originalText': u',', u'before': u''}, {u'index': 20, u'word': u'performed', u'after': u' ', u'pos': u'VBN', u'characterOffsetEnd': 158, u'characterOffsetBegin': 149, u'originalText': u'performed', u'before': u' '}, {u'index': 21, u'word': u'back', u'after': u'', u'pos': u'RB', u'characterOffsetEnd': 163, u'characterOffsetBegin': 159, u'originalText': u'back', u'before': u' '}]

Unrelated to NLTK or StanfordParser , one way to get normal reading text is to "detokenize" the output using scripts from Moses SMT ( https://github.com/moses-smt/mosesdecoder ), eg:

alvas@ubi:~$ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/detokenizer.perl
--2016-02-13 21:27:12--  https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/detokenizer.perl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 23.235.43.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|23.235.43.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12473 (12K) [text/plain]
Saving to: ‘detokenizer.perl’

100%[===============================================================================================================================>] 12,473      --.-K/s   in 0s      

2016-02-13 21:27:12 (150 MB/s) - ‘detokenizer.perl’ saved [12473/12473]

alvas@ubi:~$ echo "constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings" 2> /tmp/null
constructed logistic regression , calibrated the low defaults portfolio to benchmark ratings

Note that the output MIGHT NOT be the same as the input but for English most of the time, it will be converted into the normal text that we read/write.

It's in the pipeline to have a detokenizer in NLTK but it will take a while for us to code it, test it and push it up to the repository, we ask for your patience (see https://github.com/nltk/nltk/issues/1214 )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM