如何在elasticsearch（rails）中继续索引文档？

Question

所以我运行了这个命令， rake environment elasticsearch:import:model CLASS='AutoPartsMapper' FORCE=true对rake environment elasticsearch:import:model CLASS='AutoPartsMapper' FORCE=true文档进行索引。在我的数据库中，我有1000000条记录=）...（我认为）需要一天的时间来建立索引这...在索引编制程序运行时，我的计算机关闭了...（我为2000000文档编制了索引）是否可以继续为文档编制索引？

Answer 1

elasicsearch-rails afaik中没有此类功能，但是您可以编写一个简单的任务来做到这一点。

namespace :es do
  task :populate, [:start_id] => :environment do |_, args|
    start_id = args[:start_id].to_i

    AutoPartsMapper.where('id > ?', start_id).order(:id).find_each do |record|
      puts "Processing record ##{record.id}"
      record.__elasticsearch__.index_document
    end
  end
end

使用bundle exec rake es:populate[<start_id>]传递记录，从该记录开始下一个批处理。

请注意，这是一个简单的解决方案，它将比批处理索引慢得多。

更新

这是一个批处理索引任务。 它快得多，并且可以自动检测要继续的记录。 它确实假定先前导入的记录以id顺序递增且没有间隙的方式进行处理。 我没有测试过，但是大多数代码来自生产系统。

namespace :es do
  task :populate_auto => :environment do |_, args|
    start_id = get_max_indexed_id
    AutoPartsMapper.find_in_batches(batch_size: 1000).where('id > ?', start_id).order(:id) do |records|
      elasticsearch_bulk_index(records)
    end
  end

  def get_max_indexed_id
    AutoPartsMapper.search(aggs: {max_id: {max: {field: :id }}}, size: 0).response[:aggregations][:max_id][:value].to_i
  end

  def elasticsearch_bulk_index(records)
    return if records.empty?
    klass = records.first.class
    klass.__elasticsearch__.client.bulk({
      index: klass.__elasticsearch__.index_name,
      type: klass.__elasticsearch__.document_type,
      body: elasticsearch_records_to_index(records)
    })
  end

  def self.elasticsearch_records_to_index(records)
    records.map do |record|
      payload = { _id: record.id, data: record.as_indexed_json }
      { index: payload }
    end
  end
end

Answer 2

如果您使用Rails 4.2+，则可以使用ActiveJob计划并使其运行。 所以，首先用这个生成它

bin/rails generate job elastic_search_index

这将使您执行类和方法：

class ElasticSearchIndexJob < ApplicationJob
  def perform
    # impleement here indexing
    AutoPartMapper.__elasticsearch__.create_index! force:true
    AutoPartMapper.__elasticsearch__.import
  end
end

将sidekiq设置为您的活跃工作提供者，并从控制台使用以下命令启动：

ElasticSearchIndexJob.perform_later

这将设置活动作业并在下一个空闲作业上执行它，但它将释放您的控制台。 您可以使其保持运行状态，并稍后在bash中检查该过程：

ps aux | grep side

这会给你类似的东西： sidekiq 4.1.2 app[1 of 12 busy]

看看这篇解释他们的帖子

http://ruby-journal.com/how-to-integrate-sidekiq-with-activejob/

希望能帮助到你

如何在elasticsearch（rails）中继续索引文档？

问题描述

2 个解决方案

解决方案1
0 2016-06-30 12:07:54

解决方案2
0 2016-06-30 12:12:39

如何在elasticsearch（rails）中继续索引文档？

问题描述

2 个解决方案

解决方案1 0 2016-06-30 12:07:54

解决方案2 0 2016-06-30 12:12:39

解决方案1
0 2016-06-30 12:07:54

解决方案2
0 2016-06-30 12:12:39