简体繁体 English

MarkLogic 8-XQuery有效地将大型结果集写入文件

[英]MarkLogic 8 - XQuery write large result set to a file efficiently

原文 2015-06-10 18:27:12 9 1 node.js/ xquery/ marklogic

UPDATE: See MarkLogic 8 - Stream large result set to a file - JavaScript - Node.js Client API for someone's answer on how to do this in Javascript. 更新：请参见MarkLogic 8-将大型结果集流式传输到文件-JavaScript-Node.js客户端API，以获取有关如何使用Javascript进行操作的答案。 This question is specifically asking about XQuery. 这个问题专门询问有关XQuery的问题。

I have a web application that consumes rest services hosted in node.js. 我有一个Web应用程序，它使用node.js中托管的其余服务。

Node simply proxies the request to XQuery which then queries MarkLogic. 节点仅将请求代理到XQuery，然后查询MarkLogic。 These queries already have paging setup and work fine in the normal case to return a page of data to the UI. 这些查询已经具有分页设置，并且在正常情况下可以正常工作以将页面数据返回到UI。

I need to have an export feature such that when I put a URL parameter of export=all on a request, it doesn't lookup a page anymore. 我需要具有导出功能，以便当我在请求上放置export=all的URL参数时，它不再查找页面。

At that point it should get the whole result set, even if it's a million records, and save it to a file. 到那时，它应该获得整个结果集（即使它是一百万条记录），并将其保存到文件中。

The actual request needs to return immediately saying, "We will notify you when your download is ready." 实际的请求需要立即返回，并说：“下载完成后，我们会通知您。”

One suggestion was to use xdmp:spawn to call the XQuery in the background which would save the results to a file. 一种建议是使用xdmp:spawn在后台调用XQuery，这会将结果保存到文件中。 My actual HTTP request could then return immediately. 然后，我的实际HTTP请求可能立即返回。

For the spawn piece, I think the idea is that I run my query with different options in order to get all results instead of one page. 对于衍生产品，我认为其想法是我使用不同的选项运行查询以获取所有结果，而不是一页。 Then I would loop through the data and create a string variable to call xdmp:save with. 然后，我将遍历数据并创建一个字符串变量以调用xdmp：save。

Some questions, is this a good idea? 有些问题，这是个好主意吗？ Is there a better way? 有没有更好的办法？ If I loop through the result set and it does happen to be very large (gigabytes) it could cause memory issues. 如果我遍历结果集，而它确实很大（千兆字节），则可能导致内存问题。

Is there no way to directly stream the results to a file in XQuery? 无法将结果直接流式传输到XQuery中的文件吗？

Note: Another idea I had was to intercept the request at the proxy (node) layer and then do an xdmp:estimate to get the record count and then loop through querying each page and flushing it to disk. 注意：我的另一个想法是在代理（节点）层拦截请求，然后执行xdmp：estimate以获取记录计数，然后循环查询每个页面并将其刷新到磁盘。 In this case I would need to find some way to return my request immediately yet process in the background in node which seems to have some ideas here: http://www.pubnub.com/blog/node-background-jobs-async-processing-for-async-language/ 在这种情况下，我需要找到某种方式立即返回我的请求，但仍在节点的后台处理，这似乎在这里有一些想法： http : //www.pubnub.com/blog/node-background-jobs-async-处理换异步语言/

1 个解决方案

One possible strategy would be to use a self-spawning task that, on each iteration, gets the next page of the results for a query. 一种可能的策略是使用自动生成的任务，该任务在每次迭代时获取查询结果的下一页。

Instead of saving the results directly to a file, however, you might want to consider using xdmp:http-post() to send each page to a server: 但是，您可能要考虑使用xdmp：http-post（）将每个页面发送到服务器，而不是将结果直接保存到文件中：

http://docs.marklogic.com/xdmp:http-post?q=xdmp:http-post&v=8.0&api=true http://docs.marklogic.com/xdmp:http-post?q=xdmp:http-post&v=8.0&api=true

In particular, the server could be a Node.js server that appends each page as it arrives to a file or any other datasink. 特别是，该服务器可以是Node.js服务器，它会将到达页面的每个页面附加到文件或任何其他数据接收器中。

That way, Node.js could handle the long-running asynchronous IO with minimal load on the database server. 这样，Node.js可以以最小的数据库服务器负载处理长时间运行的异步IO。

When a self-spawned task hits the end of the query, it can again use an HTTP request to notify Node.js to close the file and report that the export is finished. 当自动生成的任务到达查询末尾时，它可以再次使用HTTP请求来通知Node.js关闭文件并报告导出已完成。

Hping that helps, 祝您好运，