[英]Publishing more than actual messages in Google Pub/Sub using Node.js and csv-parse
Using Node.js, Google Pub/Sub, csv-parse.使用 Node.js、Google Pub/Sub、csv-parse。
Use case - I have a large csv file to process and import in my DB.用例 - 我有一个大型 csv 文件要处理并导入到我的数据库中。 It has few third party APIs which take 1 second to process each row.它几乎没有需要 1 秒来处理每一行的第三方 API。 So process flow is below -所以流程如下 -
Problem - As soon as my listener downloads the file it send x no.问题 - 一旦我的听众下载文件,它就会发送 x 号。 of row messages to next PubSubNo2 but when i check its subscription it shows more than x messages.下一个 PubSubNo2 的行消息,但是当我检查它的订阅时,它显示了超过 x 条消息。 eg I upload a 6000 record csv and on subscriber it shows more than 40K-50K messages.例如,我上传了 6000 条记录 csv,并且在订阅者上显示了超过 40K-50K 的消息。
Package.json包.json
"dependencies": {
"@google-cloud/pubsub": "1.5.0",
"axios": "^0.19.2",
"csv-parse": "^4.8.5",
"dotenv": "^8.2.0",
"google-gax": "1.14.1",
"googleapis": "47.0.0",
"moment": "^2.24.0",
"path": "^0.12.7",
"pg": "^7.18.1",
"winston": "^3.0.0"
}
Publisher Code出版商代码
async processFile(filename) {
let cnt = 0;
let index = null;
let rowCounter = 0;
const handler = (resolve, reject) => {
const parser = CsvParser({
delimiter: ',',
})
.on('readable', () => {
let row;
let hello = 0;
let busy = false;
this.meta.totalRows = (parser.info.records - 1);
while (row = parser.read()) {
if (cnt++ === 0) {
index = row;
continue;
}
let messageObject = {
customFieldsMap: this.customFieldsMap,
importAttributes: this.jc.attrs,
importColumnData: row,
rowCount: cnt,
importColumnList: index,
authToken: this.token
}
let topicPublishResult = PubSubPublish.publishToTopic(process.env.IMPORT_CSV_ROW_PUBLISHING_TOPIC, messageObject);
topicPublishResult.then((response) => {
rowCounter += 1;
const messageInfo = "Row " + rowCounter + " published" +
" | MessageId = " + response +
" | importId = " + this.data.importId +
" | fileId = " + this.data.fileId +
" | orgId = " + this.data.orgId;
console.info(messageInfo);
})
}
})
.on('end', () => {
console.log("File consumed!");
resolve(this.setStatus("queued"))
})
.on('error', reject);
fs.createReadStream(filename).pipe(parser);
};
await new Promise(handler);
}
And Publish module code并发布模块代码
const {
PubSub
} = require('@google-cloud/pubsub');
const pubsub = new PubSub({
projectId: process.env.PROJECT_ID
});
module.exports = {
publishToTopic: function(topicName, data) {
return pubsub.topic(topicName, {
batching: {
maxMessages: 500,
maxMilliseconds: 5000,
}
}).publish(Buffer.from(JSON.stringify(data)));
},
};
This works without any issues for file os 10, 100,200,2000 records but giving trouble with more as in for 6K records.这对于文件 os 10, 100,200,2000 记录没有任何问题,但在 6K 记录中遇到更多问题。 After I publish 6K records there is an error of UnhandledPromiseRejection for all 6K records eg在我发布 6K 记录后,所有 6K 记录都会出现 UnhandledPromiseRejection 错误,例如
(node:49994) UnhandledPromiseRejectionWarning: Error: Retry total timeout exceeded before any response was received
at repeat (/Users/tarungupta/office/import-processor/node_modules/google-gax/build/src/normalCalls/retries.js:65:31)
at Timeout._onTimeout (/Users/tarungupta/office/import-processor/node_modules/google-gax/build/src/normalCalls/retries.js:100:25)
at listOnTimeout (internal/timers.js:531:17)
at processTimers (internal/timers.js:475:7)
(node:49994) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 6000)
Any help is appreciated!任何帮助表示赞赏!
It's possible that your publisher is getting overwhelmed when you have 6,000 messages to publish.当您有 6,000 条消息要发布时,您的发布者可能会不堪重负。 The reason is that you create a new instance of the publisher for each message that you create in your publishToTopic
method.原因是您为在publishToTopic
方法中创建的每条消息都创建了一个新的发布者实例。 Consequently, you are not getting to take advantage of any batching and you are waiting 5 seconds to send every message.因此,您无法利用任何批处理功能,并且要等待 5 秒钟才能发送每条消息。 That's a lot of overhead for each message.这对于每条消息来说都是很大的开销。 It could mean that callbacks are not getting processed in a timely fashion, resulting in timeouts and attempts to resend.这可能意味着回调没有得到及时处理,导致超时和尝试重新发送。 You want to create your pubsub.topic
object a single time and then reuse it across publish calls.您希望一次创建pubsub.topic
对象,然后在发布调用中重复使用它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.