简体   繁体   English

使用 Node.js 和 csv-parse 在 Google Pub/Sub 中发布比实际消息更多的消息

[英]Publishing more than actual messages in Google Pub/Sub using Node.js and csv-parse

Using Node.js, Google Pub/Sub, csv-parse.使用 Node.js、Google Pub/Sub、csv-parse。

Use case - I have a large csv file to process and import in my DB.用例 - 我有一个大型 csv 文件要处理并导入到我的数据库中。 It has few third party APIs which take 1 second to process each row.它几乎没有需要 1 秒来处理每一行的第三方 API。 So process flow is below -所以流程如下 -

  1. User uploads the file用户上传文件
  2. node server upload the file in storage and send message to PubSubNo.1节点服务器上传存储中的文件并向PubSubNo.1发送消息
  3. Now My listener listens to above pubsub and starts processing these messages , it download the file and start breaking each row and publishes to another PubSub for further processing现在我的监听器监听 pubsub 上面的消息并开始处理这些消息,它下载文件并开始破坏每一行并发布到另一个 PubSub 以供进一步处理
  4. In the end I parallel process these smaller row messages and achieve faster processing.最后我并行处理这些较小的行消息并实现更快的处理。

Problem - As soon as my listener downloads the file it send x no.问题 - 一旦我的听众下载文件,它就会发送 x 号。 of row messages to next PubSubNo2 but when i check its subscription it shows more than x messages.下一个 PubSubNo2 的行消息,但是当我检查它的订阅时,它显示了超过 x 条消息。 eg I upload a 6000 record csv and on subscriber it shows more than 40K-50K messages.例如,我上传了 6000 条记录 csv,并且在订阅者上显示了超过 40K-50K 的消息。

Package.json包.json

"dependencies": {
    "@google-cloud/pubsub": "1.5.0",
    "axios": "^0.19.2",
    "csv-parse": "^4.8.5",
    "dotenv": "^8.2.0",
    "google-gax": "1.14.1",
    "googleapis": "47.0.0",
    "moment": "^2.24.0",
    "path": "^0.12.7",
    "pg": "^7.18.1",
    "winston": "^3.0.0"
  }

Publisher Code出版商代码

async processFile(filename) {
    let cnt = 0;
    let index = null;
    let rowCounter = 0;
    const handler = (resolve, reject) => {
      const parser = CsvParser({
          delimiter: ',',
        })
        .on('readable', () => {
          let row;
          let hello = 0;
          let busy = false;
          this.meta.totalRows = (parser.info.records - 1);
          while (row = parser.read()) {
            if (cnt++ === 0) {
              index = row;
              continue;
            }
            let messageObject = {
              customFieldsMap: this.customFieldsMap,
              importAttributes: this.jc.attrs,
              importColumnData: row,
              rowCount: cnt,
              importColumnList: index,
              authToken: this.token
            }
            let topicPublishResult = PubSubPublish.publishToTopic(process.env.IMPORT_CSV_ROW_PUBLISHING_TOPIC, messageObject);
            topicPublishResult.then((response) => {
              rowCounter += 1;
              const messageInfo = "Row " + rowCounter + " published" +
                " | MessageId = " + response +
                " | importId = " + this.data.importId +
                " | fileId = " + this.data.fileId +
                " | orgId = " + this.data.orgId;
              console.info(messageInfo);
            })
          }
        })
        .on('end', () => {
          console.log("File consumed!");
          resolve(this.setStatus("queued"))
        })
        .on('error', reject);
      fs.createReadStream(filename).pipe(parser);
    };
    await new Promise(handler);
  }

And Publish module code并发布模块代码

const {
  PubSub
} = require('@google-cloud/pubsub');

const pubsub = new PubSub({
  projectId: process.env.PROJECT_ID
});
module.exports = {
  publishToTopic: function(topicName, data) {
    return pubsub.topic(topicName, {
      batching: {
        maxMessages: 500,
        maxMilliseconds: 5000,
      }
    }).publish(Buffer.from(JSON.stringify(data)));
  },
};

This works without any issues for file os 10, 100,200,2000 records but giving trouble with more as in for 6K records.这对于文件 os 10, 100,200,2000 记录没有任何问题,但在 6K 记录中遇到更多问题。 After I publish 6K records there is an error of UnhandledPromiseRejection for all 6K records eg在我发布 6K 记录后,所有 6K 记录都会出现 UnhandledPromiseRejection 错误,例如

(node:49994) UnhandledPromiseRejectionWarning: Error: Retry total timeout exceeded before any response was received
    at repeat (/Users/tarungupta/office/import-processor/node_modules/google-gax/build/src/normalCalls/retries.js:65:31)
    at Timeout._onTimeout (/Users/tarungupta/office/import-processor/node_modules/google-gax/build/src/normalCalls/retries.js:100:25)
    at listOnTimeout (internal/timers.js:531:17)
    at processTimers (internal/timers.js:475:7)
(node:49994) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 6000)

Any help is appreciated!任何帮助表示赞赏!

It's possible that your publisher is getting overwhelmed when you have 6,000 messages to publish.当您有 6,000 条消息要发布时,您的发布者可能会不堪重负。 The reason is that you create a new instance of the publisher for each message that you create in your publishToTopic method.原因是您为在publishToTopic方法中创建的每条消息都创建了一个新的发布者实例。 Consequently, you are not getting to take advantage of any batching and you are waiting 5 seconds to send every message.因此,您无法利用任何批处理功能,并且要等待 5 秒钟才能发送每条消息。 That's a lot of overhead for each message.这对于每条消息来说都是很大的开销。 It could mean that callbacks are not getting processed in a timely fashion, resulting in timeouts and attempts to resend.这可能意味着回调没有得到及时处理,导致超时和尝试重新发送。 You want to create your pubsub.topic object a single time and then reuse it across publish calls.您希望一次创建pubsub.topic对象,然后在发布调用中重复使用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM