I'm trying to do sentiment analysis on a large corpus of tweets in a local MongoDB instance with Ruby on Rails 4, Ruby 2.1.2 and Mongoid ORM.
I've used the freely available https://loudelement-free-natural-language-processing-service.p.mashape.com API on Mashape.com, however it starts timing out after pushing through a few hundred tweets in rapid fire sequence -- clearly it isn't meant for going through tens of thousands of tweets and that's understandable.
So next I thought I'd use the Stanford CoreNLP library promoted here: http://nlp.stanford.edu/sentiment/code.html
The default usage, in addition to using the library in Java 1.8 code, seems to be to use XML input and output files. For my use case this is annoying given I have tens of thousands of short tweets as opposed to long text files. I would want to use CoreNLP like a method and do a tweets.each type of loop.
I guess one way would be to construct an XML file with all of the tweets and then get one out of the Java process and parse that and put it back to the DB, but that feels alien to me and would be a lot of work.
So, I was happy to find on the site linked above a way to run CoreNLP from the command line and accept the text as stdin so that I didn't have to start fiddling with the filesystem but rather feed the text as a parameter. However, starting up the JVM separately for each tweet adds a huge overhead compared to using the loudelement free sentiment analysis API.
Now, the code I wrote is ugly and slow but it works. Still, I'm wondering if there's a better way to run the CoreNLP java program from within Ruby without having to start fiddling with the filesystem (creating temp files and giving them as params) or writing Java code?
Here's the code I'm using:
def self.mass_analyze_w_corenlp # batch run the method in multiple Ruby processes
todo = Tweet.all.exists(corenlp_sentiment: false).limit(5000).sort(follow_ratio: -1) # start with the "least spammy" tweets based on follow ratio
counter = 0
todo.each do |tweet|
counter = counter+1
fork {tweet.analyze_sentiment_w_corenlp} # run the analysis in a separate Ruby process
if counter >= 5 # when five concurrent processes are running, wait until they finish to preserve memory
Process.waitall
counter = 0
end
end
end
def analyze_sentiment_w_corenlp # run the sentiment analysis for each tweet object
text_to_be_analyzed = self.text.gsub("'"){" "}.gsub('"'){' '} # fetch the text field of DB item strip quotes that confuse the command line
start = "echo '"
finish = "' | java -cp 'vendor/corenlp/*' -mx250m edu.stanford.nlp.sentiment.SentimentPipeline -stdin"
command_string = start+text_to_be_analyzed+finish # assemble the command for the command line usage below
output =`#{command_string}` # run the CoreNLP on the command line, equivalent to system('...')
to_db = output.gsub(/\s+/, "").downcase # since CoreNLP uses indentation, remove unnecessary whitespace
# output is in the format of "neutral, "positive", "negative" and so on
puts "Sentiment analysis successful, sentiment is: #{to_db} for tweet #{text_to_be_analyzed}."
self.corenlp_sentiment = to_db # insert result as a field to the object
self.save! # sentiment analysis done!
end
You can at least avoid the ugly and dangerous command line stuff by using IO.popen
to open and communicate with the external process, for example:
input_string = "
foo
bar
baz
"
output_string =
IO.popen("grep 'foo'", 'r+') do |pipe|
pipe.write(input_string)
pipe.close_write
pipe.read
end
puts "grep said #{output_string.strip} but not bar"
EDIT: to avoid the overhead of reloading the Java program on each item, you can open the pipe around the todo.each
loop an communicate with the process like this
inputs = ['a', 'b', 'c', 'd']
IO.popen('cat', 'r+') do |pipe|
inputs.each do |s|
pipe.write(s + "\n")
out = pipe.readline
puts "cat said '#{out.strip}'"
end
end
that is, if the Java program supports such line-buffered "batch" input. However, it should not be very difficult to modify it to do so, if not.
As suggested in the comments by @Qualtagh, I decided to use JRuby .
I first attempted to use Java to use MongoDB as the interface (read directly from MongoDB, analyze with Java / CoreNLP and write back to MongoDB), but the MongoDB Java Driver was more complex to use than the Mongoid ORM I use with Ruby, so this is why I felt JRuby was more appropriate.
Doing a REST service for Java would have required me first to learn how to do a REST service in Java, which might have been easy, or then not. I didn't want to spend time figuring that out.
So the code I needed to do to run my code was:
def analyze_tweet_with_corenlp_jruby
require 'java'
require 'vendor/CoreNLPTest2.jar' # I made this Java JAR with IntelliJ IDEA that includes both CoreNLP and my initialization class
analyzer = com.me.Analyzer.new # this is the Java class I made for running the CoreNLP analysis, it initializes the CoreNLP with the correct annotations etc.
result = analyzer.analyzeTweet(self.text) # self.text is where the text-to-be-analyzed resides
self.corenlp_sentiment = result # adds the result into this field in the MongoDB model
self.save!
return "#{result}: #{self.text}" # for debugging purposes
end
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.