简体   繁体   中英

How to use Stanford CoreNLP java library with Ruby for sentiment analysis?

I'm trying to do sentiment analysis on a large corpus of tweets in a local MongoDB instance with Ruby on Rails 4, Ruby 2.1.2 and Mongoid ORM.

I've used the freely available https://loudelement-free-natural-language-processing-service.p.mashape.com API on Mashape.com, however it starts timing out after pushing through a few hundred tweets in rapid fire sequence -- clearly it isn't meant for going through tens of thousands of tweets and that's understandable.

So next I thought I'd use the Stanford CoreNLP library promoted here: http://nlp.stanford.edu/sentiment/code.html

The default usage, in addition to using the library in Java 1.8 code, seems to be to use XML input and output files. For my use case this is annoying given I have tens of thousands of short tweets as opposed to long text files. I would want to use CoreNLP like a method and do a tweets.each type of loop.

I guess one way would be to construct an XML file with all of the tweets and then get one out of the Java process and parse that and put it back to the DB, but that feels alien to me and would be a lot of work.

So, I was happy to find on the site linked above a way to run CoreNLP from the command line and accept the text as stdin so that I didn't have to start fiddling with the filesystem but rather feed the text as a parameter. However, starting up the JVM separately for each tweet adds a huge overhead compared to using the loudelement free sentiment analysis API.

Now, the code I wrote is ugly and slow but it works. Still, I'm wondering if there's a better way to run the CoreNLP java program from within Ruby without having to start fiddling with the filesystem (creating temp files and giving them as params) or writing Java code?

Here's the code I'm using:

def self.mass_analyze_w_corenlp # batch run the method in multiple Ruby processes
  todo = Tweet.all.exists(corenlp_sentiment: false).limit(5000).sort(follow_ratio: -1) # start with the "least spammy" tweets based on follow ratio
  counter = 0

  todo.each do |tweet|
    counter = counter+1

    fork {tweet.analyze_sentiment_w_corenlp} # run the analysis in a separate Ruby process

    if counter >= 5 # when five concurrent processes are running, wait until they finish to preserve memory
      Process.waitall
      counter = 0
    end

  end
end

def analyze_sentiment_w_corenlp # run the sentiment analysis for each tweet object
  text_to_be_analyzed = self.text.gsub("'"){" "}.gsub('"'){' '} # fetch the text field of DB item strip quotes that confuse the command line

  start = "echo '"
  finish = "' | java -cp 'vendor/corenlp/*' -mx250m edu.stanford.nlp.sentiment.SentimentPipeline -stdin"
  command_string = start+text_to_be_analyzed+finish # assemble the command for the command line usage below

  output =`#{command_string}` # run the CoreNLP on the command line, equivalent to system('...')
  to_db = output.gsub(/\s+/, "").downcase # since CoreNLP uses indentation, remove unnecessary whitespace
  # output is in the format of "neutral, "positive", "negative" and so on

  puts "Sentiment analysis successful, sentiment is: #{to_db} for tweet #{text_to_be_analyzed}."

  self.corenlp_sentiment = to_db # insert result as a field to the object
  self.save! # sentiment analysis done!
end

You can at least avoid the ugly and dangerous command line stuff by using IO.popen to open and communicate with the external process, for example:

input_string = "
foo
bar
baz
"

output_string =
    IO.popen("grep 'foo'", 'r+') do |pipe|
        pipe.write(input_string)
        pipe.close_write
        pipe.read
    end

puts "grep said #{output_string.strip} but not bar"

EDIT: to avoid the overhead of reloading the Java program on each item, you can open the pipe around the todo.each loop an communicate with the process like this

inputs = ['a', 'b', 'c', 'd']

IO.popen('cat', 'r+') do |pipe|

    inputs.each do |s|
        pipe.write(s + "\n")
        out = pipe.readline

        puts "cat said '#{out.strip}'"
    end
end

that is, if the Java program supports such line-buffered "batch" input. However, it should not be very difficult to modify it to do so, if not.

As suggested in the comments by @Qualtagh, I decided to use JRuby .

I first attempted to use Java to use MongoDB as the interface (read directly from MongoDB, analyze with Java / CoreNLP and write back to MongoDB), but the MongoDB Java Driver was more complex to use than the Mongoid ORM I use with Ruby, so this is why I felt JRuby was more appropriate.

Doing a REST service for Java would have required me first to learn how to do a REST service in Java, which might have been easy, or then not. I didn't want to spend time figuring that out.

So the code I needed to do to run my code was:

  def analyze_tweet_with_corenlp_jruby
    require 'java'
    require 'vendor/CoreNLPTest2.jar' # I made this Java JAR with IntelliJ IDEA that includes both CoreNLP and my initialization class

    analyzer = com.me.Analyzer.new # this is the Java class I made for running the CoreNLP analysis, it initializes the CoreNLP with the correct annotations etc.
    result = analyzer.analyzeTweet(self.text) # self.text is where the text-to-be-analyzed resides

    self.corenlp_sentiment = result # adds the result into this field in the MongoDB model
    self.save!
    return "#{result}: #{self.text}" # for debugging purposes
  end

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM