简体   繁体   English

如何在elasticsearch(rails)中继续索引文档?

[英]How to continue indexing documents in elasticsearch(rails)?

所以我运行了这个命令, rake environment elasticsearch:import:model CLASS='AutoPartsMapper' FORCE=truerake environment elasticsearch:import:model CLASS='AutoPartsMapper' FORCE=true文档进行索引。在我的数据库中,我有1000000条记录=)...(我认为)需要一天的时间来建立索引这...在索引编制程序运行时,我的计算机关闭了...(我为2000000文档编制了索引)是否可以继续为文档编制索引?

There is no such functionality in elasicsearch-rails afaik but you could write a simple task to do that. elasicsearch-rails afaik中没有此类功能,但是您可以编写一个简单的任务来做到这一点。

namespace :es do
  task :populate, [:start_id] => :environment do |_, args|
    start_id = args[:start_id].to_i

    AutoPartsMapper.where('id > ?', start_id).order(:id).find_each do |record|
      puts "Processing record ##{record.id}"
      record.__elasticsearch__.index_document
    end
  end
end

Start it with bundle exec rake es:populate[<start_id>] passing the id of the record from which to start the next batch. 使用bundle exec rake es:populate[<start_id>]传递记录,从该记录开始下一个批处理。

Note that this is a simplistic solution which will be much slower than batch indexing. 请注意,这是一个简单的解决方案,它将比批处理索引慢得多。

UPDATE 更新

Here is a batch indexing task. 这是一个批处理索引任务。 It is much faster and automatically detects the record from which to continue. 它快得多,并且可以自动检测要继续的记录。 It does make an assumption that previously imported records were processed in increasing id order and without gaps. 它确实假定先前导入的记录以id顺序递增且没有间隙的方式进行处理。 I haven't tested it but most of the code is from a production system. 我没有测试过,但是大多数代码来自生产系统。

namespace :es do
  task :populate_auto => :environment do |_, args|
    start_id = get_max_indexed_id
    AutoPartsMapper.find_in_batches(batch_size: 1000).where('id > ?', start_id).order(:id) do |records|
      elasticsearch_bulk_index(records)
    end
  end

  def get_max_indexed_id
    AutoPartsMapper.search(aggs: {max_id: {max: {field: :id }}}, size: 0).response[:aggregations][:max_id][:value].to_i
  end

  def elasticsearch_bulk_index(records)
    return if records.empty?
    klass = records.first.class
    klass.__elasticsearch__.client.bulk({
      index: klass.__elasticsearch__.index_name,
      type: klass.__elasticsearch__.document_type,
      body: elasticsearch_records_to_index(records)
    })
  end

  def self.elasticsearch_records_to_index(records)
    records.map do |record|
      payload = { _id: record.id, data: record.as_indexed_json }
      { index: payload }
    end
  end
end

If you use rails 4.2+ you can use ActiveJob to schedule and leave it running. 如果您使用Rails 4.2+,则可以使用ActiveJob计划并使其运行。 So, first generate it with this 所以,首先用这个生成它

bin/rails generate job elastic_search_index

This will give you class and method perform: 这将使您执行类和方法:

class ElasticSearchIndexJob < ApplicationJob
  def perform
    # impleement here indexing
    AutoPartMapper.__elasticsearch__.create_index! force:true
    AutoPartMapper.__elasticsearch__.import
  end
end

Set the sidekiq as your active job provider and from console initiate this with: 将sidekiq设置为您的活跃工作提供者,并从控制台使用以下命令启动:

ElasticSearchIndexJob.perform_later

This will set the active job and execute it on next free job but it will free your console. 这将设置活动作业并在下一个空闲作业上执行它,但它将释放您的控制台。 You can leave it running and check the process in bash later: 您可以使其保持运行状态,并稍后在bash中检查该过程:

ps aux | grep side

this will give you something like: sidekiq 4.1.2 app[1 of 12 busy] 这会给你类似的东西: sidekiq 4.1.2 app[1 of 12 busy]

Have a look at this post that explains them 看看这篇解释他们的帖子

http://ruby-journal.com/how-to-integrate-sidekiq-with-activejob/ http://ruby-journal.com/how-to-integrate-sidekiq-with-activejob/

Hope it helps 希望能帮助到你

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM