简体   繁体   English

使用 Node.js 获取大文件进行处理

[英]Fetching Large File For Processing Using Node.js

I have a Node.js application that needs to fetch this 6GB zip file from Census.gov and then process its content.我有一个 Node.js 应用程序,它需要从 Census.gov 获取这个 6GB zip 文件,然后处理其内容。 However when fetching the file using Node.js https API, the downloading stops at different file size.但是,当使用 Node.js https API 获取文件时,下载会以不同的文件大小停止。 Sometime it fails at 2GB or 1.8GB and so on.有时它会在 2GB 或 1.8GB 时失败,依此类推。 I am never able to fully download the file using the application but its fully downloaded when using the browser.我永远无法使用应用程序完全下载文件,但在使用浏览器时完全下载。 Is there any way to download the full file?有没有办法下载完整的文件? I cannot start processing the zip until its fully download, so my processing code waits for the download to complete before executing.在完全下载 zip 之前,我无法开始处理它,所以我的处理代码在执行之前等待下载完成。

const file = fs.createWriteStream(fileName);
http.get(url).on("response", function (res) {
      let downloaded = 0;
      res
        .on("data", function (chunk) {
          file.write(chunk);
          downloaded += chunk.length;
          process.stdout.write(`Downloaded ${(downloaded / 1000000).toFixed(2)} MB of ${fileName}\r`);
        })
        .on("end", async function () {
          file.end();
          console.log(`${fileName} downloaded successfully.`);
        });
    });

You have no flow control on the file.write(chunk) .您对file.write(chunk)没有流量控制 You need to pay attention to the return value from file.write(chunk) and when it returns false , you have to wait for the drain event before writing more.您需要注意file.write(chunk)的返回值,当它返回false时,您必须等待drain事件才能写入更多内容。 Otherwise, you can overflow the buffer on the writestream.否则,您可能会溢出写入流上的缓冲区。

When you lack flow control when writing large things faster than the disk can keep up, you will probably blow up your memory usage because the stream has to accumulate more data in its buffer than is desirable.当您在写入大数据的速度超过磁盘可以跟上的速度时缺乏流量控制时,您可能会破坏 memory 的使用,因为 stream 必须在其缓冲区中累积比预期更多的数据。

Since your data is coming from a readable, when you get false back from the file.write(chunk) , you will also have to pause the incoming read stream so it doesn't keep spewing data events at you while you're waiting for the drain event on the writestream.由于您的数据来自可读数据,因此当您从file.write(chunk)收到false消息时,您还必须暂停传入的读取 stream ,这样它就不会在您等待时不断向您喷射数据事件writestream 上的drain事件。 When you get the drain event, you can then resume the readstream.当您收到drain事件时,您可以resume读取流。

FYI, if you don't need the progress info, you can let pipeline() do all the work (including the flow control) for you.仅供参考,如果您不需要进度信息,您可以让pipeline()为您完成所有工作(包括流量控制)。 You don't have to write that code yourself.您不必自己编写该代码。 You may even be able to still gather the progress info, by just watching the writestream activity when using pipeline() .您甚至可以通过在使用pipeline()时仅观察 writestream 活动来收集进度信息。

Here's one way to implement the flow control yourself, though I'd recommend you use the pipeline() function in the stream module and let it do all this for you:这是自己实现流控制的一种方法,但我建议您在 stream 模块中使用pipeline() function 并让它为您完成所有这些工作:

const file = fs.createWriteStream(fileName);
http.get(url).on("response", function(res) {
    let downloaded = 0;
    res.on("data", function(chunk) {
        let readyForMore = file.write(chunk);
        if (!readyForMore) {
            // pause readstream until drain event comes
            res.pause();
            file.once('drain', () => {
                res.resume();
            });
        }
        downloaded += chunk.length;
        process.stdout.write(`Downloaded ${(downloaded / 1000000).toFixed(2)} MB of ${fileName}\r`);
    }).on("end", function() {
        file.end(); console.log(`${fileName} downloaded successfully.`);
    });
});

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM