I am trying to run Stanford parser in Ubuntu using python code. My text file is of 500 Mb which i am trying to parse.I have a RAM of 32GB. I am increasing the JVM size, but i don't whether it is actually increasing or not because every-time i am getting this error. Please help me out
WARNING!! OUT OF MEMORY! THERE WAS NOT ENOUGH ***
*** MEMORY TO RUN ALL PARSERS. EITHER GIVE THE ***
*** JVM MORE MEMORY, SET THE MAXIMUM SENTENCE ***
*** LENGTH WITH -maxLength, OR PERHAPS YOU ARE ***
*** HAPPY TO HAVE THE PARSER FALL BACK TO USING ***
*** A SIMPLER PARSER FOR VERY LONG SENTENCES. ***
Sentence has no parse using PCFG grammar (or no PCFG fallback). Skipping...
Exception in thread "main" edu.stanford.nlp.parser.common.NoSuchParseException
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.getBestParse(LexicalizedParserQuery.java:398)
at edu.stanford.nlp.parser.lexparser.LexicalizedParserQuery.getBestParse(LexicalizedParserQuery.java:370)
at edu.stanford.nlp.parser.lexparser.ParseFiles.processResults(ParseFiles.java:271)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:215)
at edu.stanford.nlp.parser.lexparser.ParseFiles.parseFiles(ParseFiles.java:74)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedParser.java:1513)
You should divide the text file into small pieces and give them to the parser one at a time. Since the parser creates an in-memory representation for a whole "document" it is given at a time (which is orders of magnitude bigger than the document on disk), it is a very bad idea to try to give it a 500 MB document in one gulp.
You should also avoid super-long "sentences", which can easily occur if casual or web-scraped text lacks sentence delimiters, or you are feeding it big tables or gibberish. The safest way to avoid this issue is to set a parameter limiting the maximum sentence length, such as -maxLength 100
.
You might want to try out the neural network dependency parser, which scales better to large tasks: http://nlp.stanford.edu/software/nndep.shtml .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.