如何使用帶有Ruby的Stanford CoreNLP java庫進行情感分析？

Question

我正在嘗試使用Ruby on Rails 4，Ruby 2.1.2和Mongoid ORM對本地MongoDB實例中的大量推文進行情感分析 。

我在Mashape.com上使用了免費提供的https://loudelement-free-natural-language-processing-service.p.mashape.com API，但它在快速啟動序列中推送幾百條推文后開始計時 - 顯然它並不意味着要通過成千上萬的推文，這是可以理解的。

接下來我想我會使用這里推廣的Stanford CoreNLP庫 ： http ： //nlp.stanford.edu/sentiment/code.html

除了在Java 1.8代碼中使用庫之外，默認用法似乎是使用XML輸入和輸出文件。 對於我的用例，這很煩人，因為我有成千上萬的短推文而不是長文本文件。 我想像一個方法一樣使用CoreNLP並做一個tweets.each類型的循環。

我想有一種方法是用所有推文構建一個XML文件，然后從Java進程中獲取一個並解析它並將其放回到數據庫中，但這對我來說很陌生並且會有很多工作。

因此，我很高興在上面鏈接的網站上找到一種從命令行運行CoreNLP的方法， 並接受文本作為stdin，這樣我就不必開始擺弄文件系統，而是將文本作為參數提供。 但是，與使用loudelement免費情緒分析API相比，為每條推文單獨啟動JVM會增加巨大的開銷。

現在，我寫的代碼是丑陋和緩慢但它的工作原理。 不過，我想知道是否有更好的方法從Ruby中運行CoreNLP java程序而不必開始擺弄文件系統（創建臨時文件並將其作為params）或編寫Java代碼？

這是我正在使用的代碼：

def self.mass_analyze_w_corenlp # batch run the method in multiple Ruby processes
  todo = Tweet.all.exists(corenlp_sentiment: false).limit(5000).sort(follow_ratio: -1) # start with the "least spammy" tweets based on follow ratio
  counter = 0

  todo.each do |tweet|
    counter = counter+1

    fork {tweet.analyze_sentiment_w_corenlp} # run the analysis in a separate Ruby process

    if counter >= 5 # when five concurrent processes are running, wait until they finish to preserve memory
      Process.waitall
      counter = 0
    end

  end
end

def analyze_sentiment_w_corenlp # run the sentiment analysis for each tweet object
  text_to_be_analyzed = self.text.gsub("'"){" "}.gsub('"'){' '} # fetch the text field of DB item strip quotes that confuse the command line

  start = "echo '"
  finish = "' | java -cp 'vendor/corenlp/*' -mx250m edu.stanford.nlp.sentiment.SentimentPipeline -stdin"
  command_string = start+text_to_be_analyzed+finish # assemble the command for the command line usage below

  output =`#{command_string}` # run the CoreNLP on the command line, equivalent to system('...')
  to_db = output.gsub(/\s+/, "").downcase # since CoreNLP uses indentation, remove unnecessary whitespace
  # output is in the format of "neutral, "positive", "negative" and so on

  puts "Sentiment analysis successful, sentiment is: #{to_db} for tweet #{text_to_be_analyzed}."

  self.corenlp_sentiment = to_db # insert result as a field to the object
  self.save! # sentiment analysis done!
end

Answer 1

通過使用IO.popen打開並與外部進程通信，您至少可以避免使用丑陋且危險的命令行，例如：

input_string = "
foo
bar
baz
"

output_string =
    IO.popen("grep 'foo'", 'r+') do |pipe|
        pipe.write(input_string)
        pipe.close_write
        pipe.read
    end

puts "grep said #{output_string.strip} but not bar"

編輯：為了避免在每個項目上重新加載Java程序的開銷，你可以打開todo.each循環周圍的管道與這樣的進程通信

inputs = ['a', 'b', 'c', 'd']

IO.popen('cat', 'r+') do |pipe|

    inputs.each do |s|
        pipe.write(s + "\n")
        out = pipe.readline

        puts "cat said '#{out.strip}'"
    end
end

也就是說，如果Java程序支持這種行緩沖的“批處理”輸入。 但是，如果不這樣做，修改它應該不是很困難。

Answer 2

正如@Qualtagh的評論所示，我決定使用JRuby 。

我首先嘗試使用Java來使用MongoDB作為接口（直接從MongoDB讀取，使用Java / CoreNLP進行分析並回寫到MongoDB），但MongoDB Java驅動程序的使用比使用Ruby的Mongoid ORM更復雜，所以這就是為什么我覺得JRuby更合適。

為Java做一個REST服務需要我首先學習如何用Java做一個REST服務，這可能很簡單，也可能不行。 我不想花時間搞清楚這一點。

所以我運行代碼所需的代碼是：

  def analyze_tweet_with_corenlp_jruby
    require 'java'
    require 'vendor/CoreNLPTest2.jar' # I made this Java JAR with IntelliJ IDEA that includes both CoreNLP and my initialization class

    analyzer = com.me.Analyzer.new # this is the Java class I made for running the CoreNLP analysis, it initializes the CoreNLP with the correct annotations etc.
    result = analyzer.analyzeTweet(self.text) # self.text is where the text-to-be-analyzed resides

    self.corenlp_sentiment = result # adds the result into this field in the MongoDB model
    self.save!
    return "#{result}: #{self.text}" # for debugging purposes
  end

如何使用帶有Ruby的Stanford CoreNLP java庫進行情感分析？

問題描述

2 個解決方案

解決方案1
0 2015-02-13 07:32:41

解決方案2
0 已采納 2015-02-22 12:30:39

如何使用帶有Ruby的Stanford CoreNLP java庫進行情感分析？

問題描述

2 個解決方案

解決方案1 0 2015-02-13 07:32:41

解決方案2 0 已采納 2015-02-22 12:30:39

解決方案1
0 2015-02-13 07:32:41

解決方案2
0 已采納 2015-02-22 12:30:39