简体   繁体   中英

Stanford Parser for Python: Output Format

I am currently using the Python interface for the Stanford Parser.

    from nltk.parse.stanford import StanfordParser
    import os

    os.environ['STANFORD_PARSER'] ='/Users/au571533/Downloads/stanford-parser-full-2016-10-31'
    os.environ['STANFORD_MODELS'] = '/Users/au571533/Downloads/stanford-parser-full-2016-10-31'
    parser=StanfordParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")

    new=list(parser.raw_parse("The young man who boarded his usual train that Sunday afternoon was twenty-four years old and fat. "))
    print new

The output I get looks something like this:

    [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['The']), Tree('JJ', ['young']), Tree('NN', ['man'])]), Tree('SBAR', [Tree('WHNP', [Tree('WP', ['who'])]), Tree('S', [Tree('VP', [Tree('VBD', ['boarded']), Tree('NP', [Tree('PRP$', ['his']), Tree('JJ', ['usual']), Tree('NN', ['train'])]), Tree('NP', [Tree('DT', ['that']), Tree('NNP', ['Sunday'])])])])])]), Tree('NP', [Tree('NN', ['afternoon'])]), Tree('VP', [Tree('VBD', ['was']), Tree('NP', [Tree('NP', [Tree('JJ', ['twenty-four']), Tree('NNS', ['years'])]), Tree('ADJP', [Tree('JJ', ['old']), Tree('CC', ['and']), Tree('JJ', ['fat'])])])]), Tree('.', ['.'])])])]

However, I only need the part of speech labels, therefore I'd like to have an output in a format that looks like word/tag.

In java it is possible to specify -outputFormat 'wordsAndTags' and it gives exactly what I want. Any hint on how to implement this in Python?

Help would be GREATLY appreciated. Thanks!

PS: Tried to use the Stanford POSTagger but it is by far less accurate on some of the words I'm interested in.

If you look at the NLTK classes for the Stanford parser , you can see that the the raw_parse_sents() method doesn't send the -outputFormat wordsAndTags option that you want, and instead sends -outputFormat Penn . If you derive your own class from StanfordParser , you could override this method and specify the wordsAndTags format.

from nltk.parse.stanford import StanfordParser

class MyParser(StanfordParser):

        def raw_parse_sents(self, sentences, verbose=False):
        """
        Use StanfordParser to parse multiple sentences. Takes multiple sentences as a
        list of strings.
        Each sentence will be automatically tokenized and tagged by the Stanford Parser.
        The output format is `wordsAndTags`.

        :param sentences: Input sentences to parse
        :type sentences: list(str)
        :rtype: iter(iter(Tree))
        """
        cmd = [
            self._MAIN_CLASS,
            '-model', self.model_path,
            '-sentences', 'newline',
            '-outputFormat', 'wordsAndTags',
        ]
        return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM