简体   繁体   English

如何使用带有Ruby的Stanford CoreNLP java库进行情感分析?

[英]How to use Stanford CoreNLP java library with Ruby for sentiment analysis?

I'm trying to do sentiment analysis on a large corpus of tweets in a local MongoDB instance with Ruby on Rails 4, Ruby 2.1.2 and Mongoid ORM. 我正在尝试使用Ruby on Rails 4,Ruby 2.1.2和Mongoid ORM对本地MongoDB实例中的大量推文进行情感分析

I've used the freely available https://loudelement-free-natural-language-processing-service.p.mashape.com API on Mashape.com, however it starts timing out after pushing through a few hundred tweets in rapid fire sequence -- clearly it isn't meant for going through tens of thousands of tweets and that's understandable. 我在Mashape.com上使用了免费提供的https://loudelement-free-natural-language-processing-service.p.mashape.com API,但它在快速启动序列中推送几百条推文后开始计时 - 显然它并不意味着要通过成千上万的推文,这是可以理解的。

So next I thought I'd use the Stanford CoreNLP library promoted here: http://nlp.stanford.edu/sentiment/code.html 接下来我想我会使用这里推广的Stanford CoreNLP库http//nlp.stanford.edu/sentiment/code.html

The default usage, in addition to using the library in Java 1.8 code, seems to be to use XML input and output files. 除了在Java 1.8代码中使用库之外,默认用法似乎是使用XML输入和输出文件。 For my use case this is annoying given I have tens of thousands of short tweets as opposed to long text files. 对于我的用例,这很烦人,因为我有成千上万的短推文而不是长文本文件。 I would want to use CoreNLP like a method and do a tweets.each type of loop. 我想像一个方法一样使用CoreNLP并做一个tweets.each类型的循环。

I guess one way would be to construct an XML file with all of the tweets and then get one out of the Java process and parse that and put it back to the DB, but that feels alien to me and would be a lot of work. 我想有一种方法是用所有推文构建一个XML文件,然后从Java进程中获取一个并解析它并将其放回到数据库中,但这对我来说很陌生并且会有很多工作。

So, I was happy to find on the site linked above a way to run CoreNLP from the command line and accept the text as stdin so that I didn't have to start fiddling with the filesystem but rather feed the text as a parameter. 因此,我很高兴在上面链接的网站上找到一种从命令行运行CoreNLP的方法, 并接受文本作为stdin,这样我就不必开始摆弄文件系统,而是将文本作为参数提供。 However, starting up the JVM separately for each tweet adds a huge overhead compared to using the loudelement free sentiment analysis API. 但是,与使用loudelement免费情绪分析API相比,为每条推文单独启动JVM会增加巨大的开销。

Now, the code I wrote is ugly and slow but it works. 现在,我写的代码是丑陋和缓慢但它的工作原理。 Still, I'm wondering if there's a better way to run the CoreNLP java program from within Ruby without having to start fiddling with the filesystem (creating temp files and giving them as params) or writing Java code? 不过,我想知道是否有更好的方法从Ruby中运行CoreNLP java程序而不必开始摆弄文件系统(创建临时文件并将其作为params)或编写Java代码?

Here's the code I'm using: 这是我正在使用的代码:

def self.mass_analyze_w_corenlp # batch run the method in multiple Ruby processes
  todo = Tweet.all.exists(corenlp_sentiment: false).limit(5000).sort(follow_ratio: -1) # start with the "least spammy" tweets based on follow ratio
  counter = 0

  todo.each do |tweet|
    counter = counter+1

    fork {tweet.analyze_sentiment_w_corenlp} # run the analysis in a separate Ruby process

    if counter >= 5 # when five concurrent processes are running, wait until they finish to preserve memory
      Process.waitall
      counter = 0
    end

  end
end

def analyze_sentiment_w_corenlp # run the sentiment analysis for each tweet object
  text_to_be_analyzed = self.text.gsub("'"){" "}.gsub('"'){' '} # fetch the text field of DB item strip quotes that confuse the command line

  start = "echo '"
  finish = "' | java -cp 'vendor/corenlp/*' -mx250m edu.stanford.nlp.sentiment.SentimentPipeline -stdin"
  command_string = start+text_to_be_analyzed+finish # assemble the command for the command line usage below

  output =`#{command_string}` # run the CoreNLP on the command line, equivalent to system('...')
  to_db = output.gsub(/\s+/, "").downcase # since CoreNLP uses indentation, remove unnecessary whitespace
  # output is in the format of "neutral, "positive", "negative" and so on

  puts "Sentiment analysis successful, sentiment is: #{to_db} for tweet #{text_to_be_analyzed}."

  self.corenlp_sentiment = to_db # insert result as a field to the object
  self.save! # sentiment analysis done!
end

You can at least avoid the ugly and dangerous command line stuff by using IO.popen to open and communicate with the external process, for example: 通过使用IO.popen打开并与外部进程通信,您至少可以避免使用丑陋且危险的命令行,例如:

input_string = "
foo
bar
baz
"

output_string =
    IO.popen("grep 'foo'", 'r+') do |pipe|
        pipe.write(input_string)
        pipe.close_write
        pipe.read
    end

puts "grep said #{output_string.strip} but not bar"

EDIT: to avoid the overhead of reloading the Java program on each item, you can open the pipe around the todo.each loop an communicate with the process like this 编辑:为了避免在每个项目上重新加载Java程序的开销,你可以打开todo.each循环周围的管道与这样的进程通信

inputs = ['a', 'b', 'c', 'd']

IO.popen('cat', 'r+') do |pipe|

    inputs.each do |s|
        pipe.write(s + "\n")
        out = pipe.readline

        puts "cat said '#{out.strip}'"
    end
end

that is, if the Java program supports such line-buffered "batch" input. 也就是说,如果Java程序支持这种行缓冲的“批处理”输入。 However, it should not be very difficult to modify it to do so, if not. 但是,如果不这样做,修改它应该不是很困难。

As suggested in the comments by @Qualtagh, I decided to use JRuby . 正如@Qualtagh的评论所示,我决定使用JRuby

I first attempted to use Java to use MongoDB as the interface (read directly from MongoDB, analyze with Java / CoreNLP and write back to MongoDB), but the MongoDB Java Driver was more complex to use than the Mongoid ORM I use with Ruby, so this is why I felt JRuby was more appropriate. 我首先尝试使用Java来使用MongoDB作为接口(直接从MongoDB读取,使用Java / CoreNLP进行分析并回写到MongoDB),但MongoDB Java驱动程序的使用比使用Ruby的Mongoid ORM更复杂,所以这就是为什么我觉得JRuby更合适。

Doing a REST service for Java would have required me first to learn how to do a REST service in Java, which might have been easy, or then not. 为Java做一个REST服务需要我首先学习如何用Java做一个REST服务,这可能很简单,也可能不行。 I didn't want to spend time figuring that out. 我不想花时间搞清楚这一点。

So the code I needed to do to run my code was: 所以我运行代码所需的代码是:

  def analyze_tweet_with_corenlp_jruby
    require 'java'
    require 'vendor/CoreNLPTest2.jar' # I made this Java JAR with IntelliJ IDEA that includes both CoreNLP and my initialization class

    analyzer = com.me.Analyzer.new # this is the Java class I made for running the CoreNLP analysis, it initializes the CoreNLP with the correct annotations etc.
    result = analyzer.analyzeTweet(self.text) # self.text is where the text-to-be-analyzed resides

    self.corenlp_sentiment = result # adds the result into this field in the MongoDB model
    self.save!
    return "#{result}: #{self.text}" # for debugging purposes
  end

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM