简体   繁体   中英

How to continue indexing documents in elasticsearch(rails)?

所以我运行了这个命令, rake environment elasticsearch:import:model CLASS='AutoPartsMapper' FORCE=truerake environment elasticsearch:import:model CLASS='AutoPartsMapper' FORCE=true文档进行索引。在我的数据库中,我有1000000条记录=)...(我认为)需要一天的时间来建立索引这...在索引编制程序运行时,我的计算机关闭了...(我为2000000文档编制了索引)是否可以继续为文档编制索引?

There is no such functionality in elasicsearch-rails afaik but you could write a simple task to do that.

namespace :es do
  task :populate, [:start_id] => :environment do |_, args|
    start_id = args[:start_id].to_i

    AutoPartsMapper.where('id > ?', start_id).order(:id).find_each do |record|
      puts "Processing record ##{record.id}"
      record.__elasticsearch__.index_document
    end
  end
end

Start it with bundle exec rake es:populate[<start_id>] passing the id of the record from which to start the next batch.

Note that this is a simplistic solution which will be much slower than batch indexing.

UPDATE

Here is a batch indexing task. It is much faster and automatically detects the record from which to continue. It does make an assumption that previously imported records were processed in increasing id order and without gaps. I haven't tested it but most of the code is from a production system.

namespace :es do
  task :populate_auto => :environment do |_, args|
    start_id = get_max_indexed_id
    AutoPartsMapper.find_in_batches(batch_size: 1000).where('id > ?', start_id).order(:id) do |records|
      elasticsearch_bulk_index(records)
    end
  end

  def get_max_indexed_id
    AutoPartsMapper.search(aggs: {max_id: {max: {field: :id }}}, size: 0).response[:aggregations][:max_id][:value].to_i
  end

  def elasticsearch_bulk_index(records)
    return if records.empty?
    klass = records.first.class
    klass.__elasticsearch__.client.bulk({
      index: klass.__elasticsearch__.index_name,
      type: klass.__elasticsearch__.document_type,
      body: elasticsearch_records_to_index(records)
    })
  end

  def self.elasticsearch_records_to_index(records)
    records.map do |record|
      payload = { _id: record.id, data: record.as_indexed_json }
      { index: payload }
    end
  end
end

If you use rails 4.2+ you can use ActiveJob to schedule and leave it running. So, first generate it with this

bin/rails generate job elastic_search_index

This will give you class and method perform:

class ElasticSearchIndexJob < ApplicationJob
  def perform
    # impleement here indexing
    AutoPartMapper.__elasticsearch__.create_index! force:true
    AutoPartMapper.__elasticsearch__.import
  end
end

Set the sidekiq as your active job provider and from console initiate this with:

ElasticSearchIndexJob.perform_later

This will set the active job and execute it on next free job but it will free your console. You can leave it running and check the process in bash later:

ps aux | grep side

this will give you something like: sidekiq 4.1.2 app[1 of 12 busy]

Have a look at this post that explains them

http://ruby-journal.com/how-to-integrate-sidekiq-with-activejob/

Hope it helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM