简体   繁体   English

How to stream read an S3 JSON file to postgreSQL using async/await in a NodeJS 12 Lambda function?

[英]How to stream read an S3 JSON file to postgreSQL using async/await in a NodeJS 12 Lambda function?

I didn't realize how perilous such a simple task could be.我没有意识到这么简单的任务会有多么危险。 We're trying to stream-read a JSON file stored in S3--I think we have that part working.我们正在尝试流式读取存储在 S3 中的 JSON 文件——我认为我们已经完成了这部分工作。 Our .on('data') callback is getting called, but Node picks and chooses what bits it wants to run--seemingly at random.我们的.on('data')回调被调用,但 Node 挑选并选择它想要运行的位 - 似乎是随机的。

We set up a stream reader.我们设置了一个 stream 阅读器。

stream.on('data', async x => { 
  await saveToDb(x);  // This doesn't await.  It processes saveToDb up until it awaits.
});

Sometimes the db call makes it to the db--but most of the time it doesn't.有时 db 调用会到达 db——但大多数时候它不会。 I've come to the conclusion that EventEmitter has problems with async/await event handlers.我得出的结论是 EventEmitter 在异步/等待事件处理程序方面存在问题。 It appears as though it will play along with your async method so long as your code is synchronous.只要您的代码是同步的,它似乎就会与您的异步方法一起播放。 But, at the point you await, it randomly decides whether to actually follow through with doing it or not.但是,在您等待的时候,它会随机决定是否实际执行此操作。

It streams the various chunks and we can console.log them out and see the data.它流式传输各种块,我们可以console.log将它们注销并查看数据。 But as soon as we try to fire off an await/async call, we stop seeing reliable messages.但是一旦我们尝试触发等待/异步调用,我们就会停止看到可靠的消息。

I'm running this in AWS Lambda and I've been told that there are special considerations because apparently they halt processing in some cases?我在 AWS Lambda 中运行它,我被告知有一些特殊的考虑,因为它们显然在某些情况下会停止处理?

I tried surrounding the await call in an IFFY, but that didn't work, either.我尝试在 IFFY 中包围 await 调用,但这也不起作用。

What am I missing?我错过了什么? Is there no way of telling JavaScript--"Okay, I need you to run this async task synchronously. I mean it--don't go and fire off any more event notifications, either. Just sit here and wait."?有没有办法告诉 JavaScript——“好的,我需要你同步运行这个异步任务。我的意思是——不要 go 并触发更多事件通知。只是坐在这里等待。”?

TL;DR: TL;博士:

  • Use Async Iterators to pull from the end of your stream pipeline!使用异步迭代器从 stream 管道的末尾拉取!
  • Don't use async functions in any of your stream code!不要在任何 stream 代码中使用异步函数!

Details:细节:

The secret to life's mystery regarding async/await and streams appears to be wrapped up in Async Iterators !关于async/await和流的生命之谜的秘密似乎包含在Async Iterators中!

In short, I piped some streams together and at the very end, I created an async iterator to pull stuff out of the end so that I could asynchronously call the db.简而言之,我将一些流连接在一起,最后,我创建了一个异步迭代器来将内容拉出,以便我可以异步调用数据库。 The only thing ChunkStream does for me is to queue up to 1,000 to call the db with instead of for each item. ChunkStream 为我做的唯一一件事就是排队多达 1,000 个来调用数据库,而不是为每个项目调用。 I'm new to queues, so there may already be a better way of doing that.我是队列新手,所以可能已经有更好的方法了。

// ...
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
const JSONbigint = require('json-bigint');
JSON.parse = JSONbigint.parse; // Let there be proper bigint handling!
JSON.stringify = JSONbigint.stringify;
const stream = require('stream');
const JSONStream = require('JSONStream');

exports.handler = async (event, context) => {
    // ...
    let bucket, key;
    try {
        bucket = event.Records[0].s3.bucket.name;
        key = event.Records[0].s3.object.key;
        console.log(`Fetching S3 file: Bucket: ${bucket}, Key: ${key}`);
        const parser = JSONStream.parse('*'); // Converts file to JSON objects
        let chunkStream = new ChunkStream(1000); // Give the db a chunk of work instead of one item at a time
        let endStream = s3.getObject({ Bucket: bucket, Key: key }).createReadStream().pipe(parser).pipe(chunkStream);
        
        let totalProcessed = 0;
        async function processChunk(chunk) {
            let chunkString = JSON.stringify(chunk);
            console.log(`Upserting ${chunk.length} items (starting with index ${totalProcessed}) items to the db.`);
            await updateDb(chunkString, pool, 1000); // updateDb and pool are part of missing code
            totalProcessed += chunk.length;
        }
        
        // Async iterator
        for await (const batch of endStream) {
            // console.log(`Processing batch (${batch.length})`, batch);
            await processChunk(batch);
        }
    } catch (ex) {
        context.fail("stream S3 file failed");
        throw ex;
    }
};

class ChunkStream extends stream.Transform {
    constructor(maxItems, options = {}) {
        options.objectMode = true;
        super(options);
        this.maxItems = maxItems;
        this.batch = [];
    }
    _transform(item, enc, cb) {
        this.batch.push(item);
        if (this.batch.length >= this.maxItems) {
            // console.log(`ChunkStream: Chunk ready (${this.batch.length} items)`);
            this.push(this.batch);
            // console.log('_transform - Restarting the batch');
            this.batch = [];
        }
        cb();
    }
    _flush(cb) {
        // console.log(`ChunkStream: Flushing stream (${this.batch.length} items)`);
        if (this.batch.length > 0) {
            this.push(this.batch);
            this.batch = [];
        }
        cb();
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM