简体   繁体   中英

How to split sentences using the nltk.parse.stanford library

I'm trying to use the Stanford Parser from nltk.parse.stanford to do a bunch of NLP tasks. There are certain operations on sentences that I am able to do when I explicitly pass a sentence or a list of sentences as input. But how do I actually split a large amount of text into sentences ? (Obviously, regex with periods etc. won't work well)

I checked the documentation here and found nothing: http://www.nltk.org/api/nltk.parse.html?highlight=stanford#module-nltk.parse.stanford

I found something similar that does the job for java here: How can I split a text into sentences using the Stanford parser?

I think I need something exactly like this for the python version of the library.

First setup Stanford tools and NLTK correctly, eg in Linux:

alvas@ubi:~$ cd
alvas@ubi:~$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-12-09.zip
alvas@ubi:~$ unzip stanford-parser-full-2015-12-09.zip
alvas@ubi:~$ ls stanford-parser-full-2015-12-09
bin                        ejml-0.23.jar          lexparser-gui.sh              LICENSE.txt       README_dependencies.txt  StanfordDependenciesManual.pdf
build.xml                  ejml-0.23-src.zip      lexparser_lang.def            Makefile          README.txt               stanford-parser-3.6.0-javadoc.jar
conf                       lexparser.bat          lexparser-lang.sh             ParserDemo2.java  ShiftReduceDemo.java     stanford-parser-3.6.0-models.jar
data                       lexparser-gui.bat      lexparser-lang-train-test.sh  ParserDemo.java   slf4j-api.jar            stanford-parser-3.6.0-sources.jar
DependencyParserDemo.java  lexparser-gui.command  lexparser.sh                  pom.xml           slf4j-simple.jar         stanford-parser.jar
alvas@ubi:~$ export STANFORDTOOLSDIR=$HOME
alvas@ubi:~$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar

(See https://gist.github.com/alvations/e1df0ba227e542955a8a for more details and see https://gist.github.com/alvations/0ed8641d7d2e1941b9f9 for windows instructions)

Then, use Kiss and Strunk (2006) to sentence tokenize the text into a list of strings, where each item in the list is a sentence.

>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is the first sentnece. This is the second. And this is the third'
>>> sent_tokenize(sentences)
['This is the first sentence.', 'This is the second.', 'And this is the third']

Then feed the document stream into the stanford parser:

>>> list(list(parsed_sent) for parsed_sent in parser.raw_parse_sents(sent_tokenze(sentences)))
[[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['first']), Tree('NN', ['sentence'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['second'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('CC', ['And']), Tree('NP', [Tree('DT', ['this'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['third'])])])])])]]

This is form the nltk website ( http://www.nltk.org/api/nltk.tokenize.html?highlight=split%20sentence ):

Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.*

Sample code:

import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
print('\n-----\n'.join(sent_detector.tokenize('hello there. how are you doing today, mr. bojangles?')))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM