简体   繁体   English

管道中的 csv-parse 错误处理

[英]csv-parse error handling in pipe

As part of an application I am building, I am reading and manipulating large (approximately 5.5GB, 8 million rows) csv files using csv-parse .作为我正在构建的应用程序的一部分,我正在使用csv-parse读取和操作大型(大约 5.5GB,800 万行)csv 文件。 I have the process running relatively smoothly, but am stuck on one item - catching the errors throw by an inconsistent number of columns.我的过程运行得相对顺利,但被困在一个项目上 - 捕获由不一致的列数引发的错误。

I'm using the pipe function because it works well with the rest of the application, but my question is, how can I redirect errors thrown by the parser to a log and allow the process to continue?我正在使用管道函数,因为它与应用程序的其余部分配合得很好,但我的问题是,如何将解析器抛出的错误重定向到日志并允许进程继续?

I recognize that I could use the relax_column_count option to skip the records which have an inconsistent number of columns, and that option is almost sufficient.我认识到我可以使用relax_column_count选项来跳过列数不一致的记录,而该选项几乎就足够了。 The catch is that for data quality assessment purposes I need to log those records so I can go back and review what caused the incorrect number of columns (the process is a feed with many potential fault points).问题在于,出于数据质量评估的目的,我需要记录这些记录,以便我可以返回并查看导致列数不正确的原因(该过程是一个具有许多潜在故障点的提要)。

As a side note, I know the easiest way to solve this would be to clean up the data upstream of this process, but unfortunately I don't control the data source.作为旁注,我知道解决此问题的最简单方法是清理此过程上游的数据,但不幸的是我无法控制数据源。


In the example set, for example, I get the following error:例如,在示例集中,我收到以下错误:

events.js:141事件.js:141
throw er;扔er; // Unhandled 'error' event // 未处理的“错误”事件
Error: Number of columns on line (line number) does not match header错误:行(行号)上的列与标题不匹配


Sample data (not actually my data, but demonstrating the same problem):示例数据(实际上不是我的数据,而是展示了同样的问题):

year, month, value1, value2
2012, 10, A, B
2012, 11, B, C,
2012, 11, C, D,
2013, 11, D, E,
2013, 11, E, F,
2013, 11, F, 
2013, 11, G, G,
2013, 1, H, H,
2013, 11, I, I,
2013, 12, J, J,
2014, 11, K, K,
2014, 4, L, L,
2014, 11, M, M,
2014, 5, N, 
2014, 11, O, N,
2014, 6, P, O,
2015, 11, Q, P,
2015, 11, R, Q,
2015, 11, S, R,
2015, 11, T, S, 

Code:代码:

const fs = require('fs');
const parse = require('csv-parse');
const stringify = require('csv-stringify');
const transform = require('stream-transform');

const paths = {
    input: './sample.csv',
    output: './output.csv',
    error: './errors.csv',
}

var input  = fs.createReadStream(paths.input);
var output = fs.createWriteStream(paths.output);
var error  = fs.createWriteStream(paths.error);

var stringifier = stringify({
    header: true,
    quotedString: true,
});
var parser = parse({
    relax: true,
    delimiter: ',', 
    columns: true, 
    //relax_column_count: true,
})
var transformer = transform((record, callback) => {
    callback(null, record);
}, {parallel: 10});

input.pipe(parser).pipe(transformer).pipe(stringifier).pipe(output);

Thoughts?想法?

I developed a solution to this problem.我开发了一个解决这个问题的方法。 It doesn't use the pipe API , but instead uses the CSV package's callback API instead.它不使用管道 API ,而是使用 CSV 包的回调 API It's less elegant than I would have liked it to be, but it's functional and has the benefit of explicit error handling which doesn't cause the process to come to a grinding halt on an inconsistent number of columns.它不像我希望的那样优雅,但它是功能性的,并且具有显式错误处理的好处,这不会导致过程在不一致的列数上停止。

The process reads the file in line by line, parses the line against a list of expected fields in the settings object ( settings.mapping ), and then transforms, stringifies, and writes the resulting line of output to the new csv.该过程逐行读取文件,根据settings对象 ( settings.mapping ) 中的预期字段列表解析该行,然后转换、字符串化并将结果输出行写入新的 csv。

I set it up to log the errors due to the number of columns inconsistent with the header to the file along with some extra data (the datetime of execution, the row number, and the full line as text for diagnostic information. I didn't set up logging of the other error types, since they are all downstream of the csv structural errors, but you could modify the code to write those errors as well. (You could also probably write them to JSON or a MySQL database, but one thing at a time).我将其设置为记录由于列数与文件头不一致以及一些额外数据(执行日期时间、行号和整行作为诊断信息的文本)而导致的错误。我没有设置其他错误类型的日志记录,因为它们都位于 csv 结构错误的下游,但您也可以修改代码以编写这些错误。(您也可以将它们写入 JSON 或 MySQL 数据库,但有一件事一次)。

The good news is appears that there's not a huge performance hit from using this approach over a straight approach.好消息是,与直接方法相比,使用这种方法并没有对性能造成巨大影响。 I haven't done any formal performance testing, but on a 60MB file the performance is roughly the same between the two methods (assuming the file has no inconsistent rows).我没有做过任何正式的性能测试,但在 60MB 的文件上,两种方法的性能大致相同(假设文件没有不一致的行)。 A definite next step is to look into bundling the writes to disk to reduce I/O.明确的下一步是研究将写入捆绑到磁盘以减少 I/O。

I'm still very interested in if there's a better way to do this, so if you've got an idea, definitely post an answer!我仍然很想知道是否有更好的方法来做到这一点,所以如果您有想法,请务必发布答案! In the meantime, I figured I'd post this working answer in case it's useful for others struggling with the same types of inconsistently formatted sources.与此同时,我想我会发布这个有效的答案,以防它对其他人在使用相同类型的格式不一致的源时苦苦挣扎时有用。


Credit where credit is due, specifically to two questions/answers:信用到期的信用,特别是两个问题/答案:

  • parsing huge logfiles in Node.js - read in line-by-line 在 Node.js 中解析巨大的日志文件 - 逐行读取
    • This answer adapts some of the core code from the answers which split files to be read line by line, which prevents the csv-parse component from shutting down at a failed line (at the expense of the code overhead from splitting the file further upstream).该答案改编了将逐行读取的拆分文件的答案中的一些核心代码,这可以防止 csv-parse 组件在失败的行处关闭(以代码开销为代价将文件进一步拆分到上游) . I actually really recommend the use of iconv-lite as it's done in that post, but it wasn't germane to the minimally reproducible example, so I removed it for this post.实际上,我真的建议使用 iconv-lite,因为它在该帖子中已完成,但它与最低限度可重复的示例没有密切关系,因此我在这篇文章中将其删除。
  • Error handling with node.js streams 使用 node.js 流进行错误处理
    • This was generally helpful in better understanding the potential and limitations of pipes.这通常有助于更好地了解管道的潜力和局限性。 It looks like there's theoretically a way to put what essentially amounts to a pipe splitter onto the outbound pipe from parser, but given my current time constraint and the challenges associated with an async process which would be fairly unpredictable in terms of stream termination, I used the callback API instead.看起来理论上有一种方法可以将基本上相当于管道分离器的内容放到来自解析器的出站管道上,但是考虑到我当前的时间限制以及与异步进程相关的挑战,在流终止方面这将是相当不可预测的,我使用而是回调 API。

Sample code:示例代码:

'use strict'
// Dependencies
const es     = require('event-stream');
const fs     = require('fs');
const parse = require('csv-parse');
const stringify = require('csv-stringify');
const transform = require('stream-transform');

// Reference objects
const paths = {
    input: 'path to input.csv',
    output: 'path to output.csv',
    error: 'path to error output.csv',
}
const settings = {
    mapping: {
        // Each field is an object with the field name as the key
        // and can have additional properties for use in the transform 
        // component of this process
        // Example
        'year' : {
            import: true,
        }
    }
}

const metadata = {
    records: 0,
    error: 0
}

// Set up streams
var input  = fs.createReadStream(paths.input);
var errors  = fs.createWriteStream(paths.error,  {flags: 'ax'});
var output = fs.createWriteStream(paths.output, {flags: 'ax'});

// Begin process (can be refactored into function, but simplified here)
input
  .pipe(es.split()) // split based on row, assumes \n row endings
  .pipe(es.mapSync(line => { // synchronously process each line

    // Remove headers, specified through settings
    if (metadata.records === 0) return metadata.records++;
    var id = metadata.records;

    // Parse csv by row 
    parse(line, {
        relax: true,
        delimiter: ',', 
        columns: Object.keys(settings.mapping),
    }, (error, record) => {

        // Write inconsistent column error 
        if (error) {
            metadata.error++;
            errors.write(
                new Date() + ', Inconsistent Columns, ' + 
                 id + ', `' +  
                 line + '`\n'
            );
        }

    // Apply transform / reduce
    transform(record, (record) => {
        // Do stuff to record
        return record;
    }, (error, record) => {

        // Throw tranform errors
        if (error) {
            throw error;
        }

    // Stringify results and write to new csv
    stringify(record, {
           header: false,
           quotedString: true,
    }, (error, record) => {

        // Throw stringify errors
        if (error) {
            console.log(error);
        }

        // Write record to new csv file
        output.write(record);
    });
    });
    })

    // Increment record count
    metadata.records++;

  }))  
  .on('end', () => {
    metadata.records--;
    console.log(metadata)
  })    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM