[英]csv-parse error handling in pipe
As part of an application I am building, I am reading and manipulating large (approximately 5.5GB, 8 million rows) csv files using csv-parse .作为我正在构建的应用程序的一部分,我正在使用csv-parse读取和操作大型(大约 5.5GB,800 万行)csv 文件。 I have the process running relatively smoothly, but am stuck on one item - catching the errors throw by an inconsistent number of columns.
我的过程运行得相对顺利,但被困在一个项目上 - 捕获由不一致的列数引发的错误。
I'm using the pipe function because it works well with the rest of the application, but my question is, how can I redirect errors thrown by the parser to a log and allow the process to continue?我正在使用管道函数,因为它与应用程序的其余部分配合得很好,但我的问题是,如何将解析器抛出的错误重定向到日志并允许进程继续?
I recognize that I could use the relax_column_count
option to skip the records which have an inconsistent number of columns, and that option is almost sufficient.我认识到我可以使用
relax_column_count
选项来跳过列数不一致的记录,而该选项几乎就足够了。 The catch is that for data quality assessment purposes I need to log those records so I can go back and review what caused the incorrect number of columns (the process is a feed with many potential fault points).问题在于,出于数据质量评估的目的,我需要记录这些记录,以便我可以返回并查看导致列数不正确的原因(该过程是一个具有许多潜在故障点的提要)。
As a side note, I know the easiest way to solve this would be to clean up the data upstream of this process, but unfortunately I don't control the data source.作为旁注,我知道解决此问题的最简单方法是清理此过程上游的数据,但不幸的是我无法控制数据源。
In the example set, for example, I get the following error:例如,在示例集中,我收到以下错误:
events.js:141
事件.js:141
throw er;扔er; // Unhandled 'error' event
// 未处理的“错误”事件
Error: Number of columns on line (line number) does not match header错误:行(行号)上的列数与标题不匹配
Sample data (not actually my data, but demonstrating the same problem):示例数据(实际上不是我的数据,而是展示了同样的问题):
year, month, value1, value2
2012, 10, A, B
2012, 11, B, C,
2012, 11, C, D,
2013, 11, D, E,
2013, 11, E, F,
2013, 11, F,
2013, 11, G, G,
2013, 1, H, H,
2013, 11, I, I,
2013, 12, J, J,
2014, 11, K, K,
2014, 4, L, L,
2014, 11, M, M,
2014, 5, N,
2014, 11, O, N,
2014, 6, P, O,
2015, 11, Q, P,
2015, 11, R, Q,
2015, 11, S, R,
2015, 11, T, S,
Code:代码:
const fs = require('fs');
const parse = require('csv-parse');
const stringify = require('csv-stringify');
const transform = require('stream-transform');
const paths = {
input: './sample.csv',
output: './output.csv',
error: './errors.csv',
}
var input = fs.createReadStream(paths.input);
var output = fs.createWriteStream(paths.output);
var error = fs.createWriteStream(paths.error);
var stringifier = stringify({
header: true,
quotedString: true,
});
var parser = parse({
relax: true,
delimiter: ',',
columns: true,
//relax_column_count: true,
})
var transformer = transform((record, callback) => {
callback(null, record);
}, {parallel: 10});
input.pipe(parser).pipe(transformer).pipe(stringifier).pipe(output);
Thoughts?想法?
I developed a solution to this problem.我开发了一个解决这个问题的方法。 It doesn't use the pipe API , but instead uses the CSV package's callback API instead.
它不使用管道 API ,而是使用 CSV 包的回调 API 。 It's less elegant than I would have liked it to be, but it's functional and has the benefit of explicit error handling which doesn't cause the process to come to a grinding halt on an inconsistent number of columns.
它不像我希望的那样优雅,但它是功能性的,并且具有显式错误处理的好处,这不会导致过程在不一致的列数上停止。
The process reads the file in line by line, parses the line against a list of expected fields in the settings
object ( settings.mapping
), and then transforms, stringifies, and writes the resulting line of output to the new csv.该过程逐行读取文件,根据
settings
对象 ( settings.mapping
) 中的预期字段列表解析该行,然后转换、字符串化并将结果输出行写入新的 csv。
I set it up to log the errors due to the number of columns inconsistent with the header to the file along with some extra data (the datetime of execution, the row number, and the full line as text for diagnostic information. I didn't set up logging of the other error types, since they are all downstream of the csv structural errors, but you could modify the code to write those errors as well. (You could also probably write them to JSON or a MySQL database, but one thing at a time).我将其设置为记录由于列数与文件头不一致以及一些额外数据(执行日期时间、行号和整行作为诊断信息的文本)而导致的错误。我没有设置其他错误类型的日志记录,因为它们都位于 csv 结构错误的下游,但您也可以修改代码以编写这些错误。(您也可以将它们写入 JSON 或 MySQL 数据库,但有一件事一次)。
The good news is appears that there's not a huge performance hit from using this approach over a straight approach.好消息是,与直接方法相比,使用这种方法并没有对性能造成巨大影响。 I haven't done any formal performance testing, but on a 60MB file the performance is roughly the same between the two methods (assuming the file has no inconsistent rows).
我没有做过任何正式的性能测试,但在 60MB 的文件上,两种方法的性能大致相同(假设文件没有不一致的行)。 A definite next step is to look into bundling the writes to disk to reduce I/O.
明确的下一步是研究将写入捆绑到磁盘以减少 I/O。
I'm still very interested in if there's a better way to do this, so if you've got an idea, definitely post an answer!我仍然很想知道是否有更好的方法来做到这一点,所以如果您有想法,请务必发布答案! In the meantime, I figured I'd post this working answer in case it's useful for others struggling with the same types of inconsistently formatted sources.
与此同时,我想我会发布这个有效的答案,以防它对其他人在使用相同类型的格式不一致的源时苦苦挣扎时有用。
Credit where credit is due, specifically to two questions/answers:信用到期的信用,特别是两个问题/答案:
Sample code:示例代码:
'use strict'
// Dependencies
const es = require('event-stream');
const fs = require('fs');
const parse = require('csv-parse');
const stringify = require('csv-stringify');
const transform = require('stream-transform');
// Reference objects
const paths = {
input: 'path to input.csv',
output: 'path to output.csv',
error: 'path to error output.csv',
}
const settings = {
mapping: {
// Each field is an object with the field name as the key
// and can have additional properties for use in the transform
// component of this process
// Example
'year' : {
import: true,
}
}
}
const metadata = {
records: 0,
error: 0
}
// Set up streams
var input = fs.createReadStream(paths.input);
var errors = fs.createWriteStream(paths.error, {flags: 'ax'});
var output = fs.createWriteStream(paths.output, {flags: 'ax'});
// Begin process (can be refactored into function, but simplified here)
input
.pipe(es.split()) // split based on row, assumes \n row endings
.pipe(es.mapSync(line => { // synchronously process each line
// Remove headers, specified through settings
if (metadata.records === 0) return metadata.records++;
var id = metadata.records;
// Parse csv by row
parse(line, {
relax: true,
delimiter: ',',
columns: Object.keys(settings.mapping),
}, (error, record) => {
// Write inconsistent column error
if (error) {
metadata.error++;
errors.write(
new Date() + ', Inconsistent Columns, ' +
id + ', `' +
line + '`\n'
);
}
// Apply transform / reduce
transform(record, (record) => {
// Do stuff to record
return record;
}, (error, record) => {
// Throw tranform errors
if (error) {
throw error;
}
// Stringify results and write to new csv
stringify(record, {
header: false,
quotedString: true,
}, (error, record) => {
// Throw stringify errors
if (error) {
console.log(error);
}
// Write record to new csv file
output.write(record);
});
});
})
// Increment record count
metadata.records++;
}))
.on('end', () => {
metadata.records--;
console.log(metadata)
})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.