I'm trying to use the Stanford Parser from nltk.parse.stanford to do a bunch of NLP tasks. There are certain operations on sentences that I am able to do when I explicitly pass a sentence or a list of sentences as input. But how do I actually split a large amount of text into sentences ? (Obviously, regex with periods etc. won't work well)
I checked the documentation here and found nothing: http://www.nltk.org/api/nltk.parse.html?highlight=stanford#module-nltk.parse.stanford
I found something similar that does the job for java here: How can I split a text into sentences using the Stanford parser?
I think I need something exactly like this for the python version of the library.
First setup Stanford tools and NLTK correctly, eg in Linux:
alvas@ubi:~$ cd
alvas@ubi:~$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-12-09.zip
alvas@ubi:~$ unzip stanford-parser-full-2015-12-09.zip
alvas@ubi:~$ ls stanford-parser-full-2015-12-09
bin ejml-0.23.jar lexparser-gui.sh LICENSE.txt README_dependencies.txt StanfordDependenciesManual.pdf
build.xml ejml-0.23-src.zip lexparser_lang.def Makefile README.txt stanford-parser-3.6.0-javadoc.jar
conf lexparser.bat lexparser-lang.sh ParserDemo2.java ShiftReduceDemo.java stanford-parser-3.6.0-models.jar
data lexparser-gui.bat lexparser-lang-train-test.sh ParserDemo.java slf4j-api.jar stanford-parser-3.6.0-sources.jar
DependencyParserDemo.java lexparser-gui.command lexparser.sh pom.xml slf4j-simple.jar stanford-parser.jar
alvas@ubi:~$ export STANFORDTOOLSDIR=$HOME
alvas@ubi:~$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar
(See https://gist.github.com/alvations/e1df0ba227e542955a8a for more details and see https://gist.github.com/alvations/0ed8641d7d2e1941b9f9 for windows instructions)
Then, use Kiss and Strunk (2006) to sentence tokenize the text into a list of strings, where each item in the list is a sentence.
>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is the first sentnece. This is the second. And this is the third'
>>> sent_tokenize(sentences)
['This is the first sentence.', 'This is the second.', 'And this is the third']
Then feed the document stream into the stanford parser:
>>> list(list(parsed_sent) for parsed_sent in parser.raw_parse_sents(sent_tokenze(sentences)))
[[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['first']), Tree('NN', ['sentence'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['second'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('CC', ['And']), Tree('NP', [Tree('DT', ['this'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['third'])])])])])]]
This is form the nltk website ( http://www.nltk.org/api/nltk.tokenize.html?highlight=split%20sentence ):
Punkt Sentence Tokenizer
This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.*
Sample code:
import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
print('\n-----\n'.join(sent_detector.tokenize('hello there. how are you doing today, mr. bojangles?')))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.