简体   繁体   English

在Ruby / Rails中处理大型数据集导入

[英]Working with large dataset imports in Ruby/Rails

I'm currently working on a project with Ruby/Rails, importing invoices to the database but trying to maximise the efficiency of the processes which is indeed too slow right now. 我目前正在使用Ruby / Rails进行项目,将发票导入数据库,但是试图使流程的效率最大化,这确实太慢了。

For an import batch with 100.000 rows it takes around 2.5 3 hours to process and save each record in the database. 对于具有100.000行的导入批处理,大约需要2.5 3小时来处理每个记录并将其保存在数据库中。

//// Ruby code //// Ruby代码

  class DeleteImportStrategy
def pre_process(merchant_prefix, channel_import)
  # channel needed to identify invoices so an import from another channel cannot collude if they had same merchant_prefix
  Jzbackend::Invoice.where(merchant_prefix: merchant_prefix, channel: channel_import.channel).delete_all
  # get rid of all previous import patches which becomes empty after delete_import_strategy
  Jzbackend::Import.where.not(id: channel_import.id).where(channel: channel_import.channel).destroy_all
end

def process_row(row, channel_import)
  debt_claim = Jzbackend::Invoice.new
  debt_claim.import = channel_import
  debt_claim.status = 'pending'
  debt_claim.channel = channel_import.channel
  debt_claim.merchant_prefix = row[0]
  debt_claim.debt_claim_number = row[1]
  debt_claim.amount = Monetize.parse(row[2])
  debt_claim.print_date = row[3]
  debt_claim.first_name = row.try(:[], 4)
  debt_claim.last_name = row.try(:[], 5)
  debt_claim.address = row.try(:[], 6)
  debt_claim.postal_code = row.try(:[], 7)
  debt_claim.city = row.try(:[], 8)
  debt_claim.save
end

end 结束

//// ////

So for the each import batch that comes in as CSV, I get rid of previous batches and start to import new ones by reading each row and inserting it to the new Import as Invoice records. 因此,对于以CSV格式导入的每个导入批次,我都摆脱了以前的批次,并通过读取每一行并将其插入到新的“作为发票输入”记录中,开始导入新的批次。 However, 2.5-3 hours for 100.000 entries seems a bit overkill. 但是,对于100.000个条目,2.5到3个小时似乎有点过大。 How can I optimise this process as i'm sure it's definitely not efficient this way. 我如何优化此过程,因为我确信这种方法肯定无效。

Edited: So It has been long since I have posted this but just to note, I ended up to use activerecord-import library which works pretty well since then. 编辑:因此,距离我发布此文章已有很长时间了,但请注意,我最终使用了activerecord-import库,此库自那时以来运行良好。 However, note that it's :on_duplicate_key_update functionality is only available in PostgreSQL v9.5+. 但是,请注意,它的:on_duplicate_key_update功能仅在PostgreSQL v9.5 +中可用。

First rule of mass imports: batch, batch, batch. 批量进口的第一条规则:批次,批次,批次。

You're saving each row separately. 您将分别保存每一行。 This incurs HUGE overhead. 这会产生巨大的开销。 Say, the insert itself takes 1ms, but the roundtrip to the database is 5ms. 说,插入本身需要1毫秒,但是到数据库的往返时间是5毫秒。 Total time used - 6ms. 总使用时间-6毫秒。 For 1000 records that's 6000ms or 6 seconds. 对于1000条记录,即6000毫秒或6秒。

Now imagine that you use a mass insert, where you send data for multiple rows in the same statement. 现在,假设您使用大容量插入,在同一语句中发送多行数据。 It looks like this: 看起来像这样:

INSERT INTO users (name, age)
VALUES ('Joe', 20), ('Moe', 22), ('Bob', 33'), ...

Let's say, you send data for 1000 rows in this one request. 假设您在这一请求中发送了1000行数据。 The request itself takes 1000ms (but in reality it'll likely be considerably quicker too, less overhead on parsing the query, preparing the execution plan, etc.). 请求本身需要1000毫秒(但实际上,它也可能会更快,解析查询,准备执行计划等方面的开销也较小)。 Total time taken is 1000ms + 5ms. 总耗时为1000毫秒+ 5毫秒。 At least 6x reduction! 减少至少6倍! (in real projects of mine, I was observing 100x-200x reduction). (在我的真实项目中,我观察到减少了100到200倍)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM