简体繁体 English

通过Rails API响应大量对象

[英]Respond with large amount of objects through a Rails API

原文 2015-12-28 12:46:13 2 2 ruby-on-rails/ api/ rest/ csv/ export

I currently have an API for one of my projects and a service that is responsible for generating export files as CSVs, archive and store them somewhere in the cloud. 我目前有一个用于我的项目的API和一个负责将导出文件生成为CSV的服务，存档并将它们存储在云中的某个位置。

Since my API is written in Rails and my service in plain Ruby, I use the Her gem in the service to interact with the API. 由于我的API是用Rails编写的，而我的服务是用纯Ruby编写的，因此我在服务中使用Her gem来与API进行交互。 But I find my current implementation less performant, since I do a Model.all in my service, which in turn triggers a request that may contain way too many objects in the response. 但我发现我当前的实现性能较差，因为我在我的服务中执行了一个Model.all ，这反过来触发了一个请求，该请求可能包含响应中太多的对象。

I am curious on how to improve this whole task. 我很好奇如何改进这项整个任务。 Here's what I've thought of: 这就是我的想法：

implement pagination at API level and call Model.where(page: xxx) from my service; 在API级别实现分页并从我的服务调用Model.where(page: xxx) ;
generate the actual CSV at API level and send the CSV back to the service (this may be done sync or async). 在API级别生成实际CSV并将CSV发送回服务（这可以完成同步或异步）。

If I were to use the first approach, how many objects should I retrieve per page? 如果我使用第一种方法，每页应该检索多少个对象？ How big should a response be? 回复应该有多大？

If I were to use the second approach, this would bring quite an overhead to the request (and I guess API requests shouldn't take that long) and I also wonder whether it's really the API's job to do this. 如果我使用第二种方法，这将给请求带来相当大的开销（我猜API请求不应该花那么长时间），我也想知道这是否真的是API的工作。

What approach should I follow? 我应该遵循什么方法？ Or, is there something better that I'm missing? 或者，有什么比我更缺的东西？

2 个解决方案

You need to pass a lot of information through a ruby process, that's always not simple, I don't think you're missing anything here. 你需要通过ruby过程传递大量信息，这总是不简单，我不认为你在这里遗漏任何东西。

If you decide to generate CSVs at the API level then what do you get with maintaining the service? 如果您决定在API级别生成CSV，那么维护服务会得到什么？ You could just ditch the service altogether because replacing your service with an nginx proxy would do the same thing better (if you're just streaming the response from API host)? 您可以放弃服务，因为用nginx代理替换您的服务会更好地做同样的事情（如果您只是从API主机传输响应）？

If you decide to paginate, there will be a performance reduction for sure, but nobody can tell you exactly how much you should paginate - bigger pages will be faster and consume more memory (reducing throughput by being able to run less workers), smaller pages will be slower and consume less memory but demand more workers because of IO wait times, 如果您决定分页，肯定会有性能降低，但是没有人可以准确地告诉您应该分页多少 - 更大的页面会更快并消耗更多内存（通过减少工作量来减少吞吐量），较小的页面由于IO等待时间过长，速度会降低，内存消耗更少，但需要更多工作人员，

exact numbers will depend on the IO response times of your API app and the cloud and your infrastructure, I'm afraid no one can give you a simple answer you can follow without experimentation with a stress test, and once you set up a stress test, you will get a number of your own anyway - better than anybody's estimate. 确切的数字将取决于您的API应用程序以及云和您的基础架构的IO响应时间，我担心没有人可以给您一个简单的答案，您可以在没有实验压力测试的情况下遵循，并且一旦您设置了压力测试，无论如何，你会得到一些你自己的 - 比任何人估计的要好。

A suggestion, write a bit more about your problem, constraints you are working under etc and maybe someone can help you with a bit more radical solution. 一个建议，写一些关于你的问题，你正在努力的约束等等，也许有人可以帮助你一些更激进的解决方案。 For some reason I get the feeling that what you're really looking for is a background processor like sidekiq or delayed job, or maybe connect your service to the DB directly through a DB view if you are anxoius to decouple your apps, or an nginx proxy for API responses, or nothing at all... but I really can't tell without more information. 出于某种原因，我觉得你真正想要的是像sidekiq或延迟工作这样的后台处理器，或者如果你想要解耦你的应用程序或nginx，可以直接通过数据库视图将你的服务连接到数据库API响应的代理，或者根本没有...但是如果没有更多信息，我真的无法分辨。

I think it really depends how you want do define 'performance' and what your goal for your API is. 我认为这取决于您希望如何定义“效果”以及您的API目标。 Do you want to make sure no request to your API takes longer than 20msec to respond, than adding pagination would be a reasonable approach. 您是否希望确保对API的请求响应时间不超过20毫秒，而添加分页将是一种合理的方法。 Especially if the CSV generation is just an edge case, and the API is really built for other services. 特别是如果CSV生成只是一个边缘情况，并且API实际上是为其他服务构建的。 The number of items per page would then be limited by the speed at which you can deliver them. 然后，每页的项目数量将受到您提供这些项目的速度的限制。 Your service would not be particularly more performant (even less so), since it needs to call the service multiple times. 您的服务性能不会特别高（甚至更低），因为它需要多次调用服务。

Creating an async call (maybe with a webhook as callback) would be worth adding to your API if you think it is a valid use case for services to dump the whole record set. 如果您认为它是转储整个记录集的服务的有效用例，那么创建异步调用（可能使用webhook作为回调）将值得添加到您的API。

Having said that, I think strictly speaking it is the job of the API to be quick and responsive. 话虽如此，我认为严格来说，API的工作是快速响应。 So maybe try to figure out how caching can improve response times, so paging through all the records is reasonable. 所以也许试着弄清楚缓存如何能够改善响应时间，因此遍历所有记录是合理的。 On the other hand it is the job of the service to be mindful of the amount of calls to the API, so maybe store old records locally and only poll for updates instead of dumping the whole set of records each time. 另一方面，服务的工作是要注意对API的调用量，因此可能在本地存储旧记录并且仅轮询更新而不是每次都转储整组记录。