简体繁体 English

Rails / Heroku-如何为需要文件上传的流程创建后台作业

[英]Rails/Heroku - How to create a background job for process that requires file upload

原文 2014-07-31 18:45:24 2 2 ruby-on-rails/ heroku/ amazon-s3/ carrierwave/ delayed-job

I run my Rails app on Heroku. 我在Heroku上运行Rails应用程序。 I have an admin dashboard that allows for creating new objects in bulk through a custom CSV uploader. 我有一个管理控制台，可通过自定义CSV上传器批量创建新对象。 Ultimately I'll be uploading CSVs with 10k-35k rows. 最终，我将上传具有10k-35k行的CSV。 The parser works perfectly on my dev environment and 20k+ entries are successfully created through uploading the CSV. 该解析器可以在我的开发环境中完美运行，并且通过上传CSV成功创建了20k多个条目。 On Heroku, however, I run into H12 errors (request timeout). 但是，在Heroku上，我遇到了H12错误（请求超时）。 This obviously makes sense since the files are so large and so many objects are being created. 这显然是有道理的，因为文件太大，创建的对象太多。 To get around this I tried some simple solutions, amping up the dyno power on Heroku and reducing the CSV file to 2500 rows. 为了解决这个问题，我尝试了一些简单的解决方案，在Heroku上增强了dyno功能，并将CSV文件减少到2500行。 Neither of these did the trick. 这些都没有成功。

I tried to use my delayed_job implementation in combination with adding a worker dyno to my procfile to .delay the file upload and process so that the web request wouldn't timeout waiting for the file to process. 我尝试结合使用我的delayed_job实现和在procfile中添加工作动态dyno来.delay文件上载和处理，以便Web请求不会超时等待文件处理。 This fails, though, because this background process relies on a CSV upload which is held in memory at the time of the web request so the background job doesn't have the file when it executes. 但是，此操作失败，因为此后台进程依赖于CSV上载，该CSV上载在Web请求时保存在内存中，因此后台作业在执行时没有文件。

It seems like what I might need to do is : 看来我可能需要做的是 ：

Execute the upload of the CSV to S3 as a background process 执行CSV到S3的上传作为后台处理
Schedule the processing of the CSV file as a background job 将CSV文件的处理安排为后台作业
Make sure the CSV parser knows how to find the file on S3 确保CSV解析器知道如何在S3上查找文件
Parse and finish 解析并完成

This solution isn't 100% ideal as the admin user who uploads the file will essentially get an "ok, you sent the instructions" confirmation without good visibility into whether or not the process is executing properly. 此解决方案并不是100％理想的，因为上传文件的管理员用户实际上会得到“确定，您已发送指令”的确认，而无法清楚了解进程是否正确执行。 But I can handle that and fix later if it gets the job done. 但是，我可以处理该问题，并在以后完成工作时进行修复。

tl;dr question tl; dr问题

Assuming the above-mentioned solution is the right/recommended approach, how can I structure this properly? 假设上述解决方案是正确/推荐的方法，那么如何正确构建此结构？ I am mostly unclear on how to schedule/create a delayed_job entry that knows where to find a CSV file uploaded to S3 via Carrierwave. 对于如何安排/创建一个delay_job条目，我几乎不清楚，该条目知道在哪里可以找到通过Carrierwave上传到S3的CSV文件。 Any and all help much appreciated. 任何和所有帮助非常感谢。

Please request any code that's helpful. 请请求任何有用的代码。

2 个解决方案

I've primarily used sidekiq to queue asynchronous processes on heroku. 我主要使用sidekiq在heroku上排队异步进程。

This link is also a great resource to help you get started with implementing sidekiq with heroku. 该链接也是一个很棒的资源，可帮助您开始使用heroku实现sidekiq。

You can put the files that need to be processed in a specific S3 bucket and eliminate the need for passing file names to background job. 您可以将需要处理的文件放在特定的S3存储桶中，而无需将文件名传递给后台作业。

Background job can fetch files from the specific s3 bucket and start processing. 后台作业可以从特定的s3存储桶中获取文件并开始处理。

To provide real time update to the user, you can do the following: 要向用户提供实时更新，您可以执行以下操作：

use memcached to maintain the status. 使用memcached维护状态。 Background job should keep updating the status information. 后台作业应不断更新状态信息。 If you are not familiar with caching, you can use a db table. 如果您不熟悉缓存，则可以使用db表。
include javascript/jquery in the user response. 在用户响应中包含javascript / jquery。 This script should make ajax requests to get the status information and provide updates to user online. 该脚本应发出ajax请求以获取状态信息并在线向用户提供更新。 But if it is a big file, user may not want to wait for the completion of the job in which case it is better provide a query interface for checking job status. 但是，如果文件很大，则用户可能不希望等待作业完成，在这种情况下，最好提供查询界面以检查作业状态。
background job should delete/move the file from the bucket on completion. 后台作业应在完成后从存储桶中删除/移动文件。

In our app, we let users import data for multiple models and developed a generic design. 在我们的应用程序中，我们允许用户导入多个模型的数据并开发了通用设计。 We maintain the status information in db since we perform some analytics on it. 由于我们对数据库执行状态分析，因此我们将状态信息保留在数据库中。 If you are interested, here is a blog article http://koradainc.com/blog/ that describes our design. 如果您有兴趣，这里是描述我们设计的博客文章http://koradainc.com/blog/ 。 The design does not describe background process or S3 but combined with above steps should give you full solution. 该设计未描述后台过程或S3，但结合上述步骤应为您提供完整的解决方案。