简体   繁体   中英

csv-parse error handling in pipe

As part of an application I am building, I am reading and manipulating large (approximately 5.5GB, 8 million rows) csv files using csv-parse . I have the process running relatively smoothly, but am stuck on one item - catching the errors throw by an inconsistent number of columns.

I'm using the pipe function because it works well with the rest of the application, but my question is, how can I redirect errors thrown by the parser to a log and allow the process to continue?

I recognize that I could use the relax_column_count option to skip the records which have an inconsistent number of columns, and that option is almost sufficient. The catch is that for data quality assessment purposes I need to log those records so I can go back and review what caused the incorrect number of columns (the process is a feed with many potential fault points).

As a side note, I know the easiest way to solve this would be to clean up the data upstream of this process, but unfortunately I don't control the data source.


In the example set, for example, I get the following error:

events.js:141
throw er; // Unhandled 'error' event
Error: Number of columns on line (line number) does not match header


Sample data (not actually my data, but demonstrating the same problem):

year, month, value1, value2
2012, 10, A, B
2012, 11, B, C,
2012, 11, C, D,
2013, 11, D, E,
2013, 11, E, F,
2013, 11, F, 
2013, 11, G, G,
2013, 1, H, H,
2013, 11, I, I,
2013, 12, J, J,
2014, 11, K, K,
2014, 4, L, L,
2014, 11, M, M,
2014, 5, N, 
2014, 11, O, N,
2014, 6, P, O,
2015, 11, Q, P,
2015, 11, R, Q,
2015, 11, S, R,
2015, 11, T, S, 

Code:

const fs = require('fs');
const parse = require('csv-parse');
const stringify = require('csv-stringify');
const transform = require('stream-transform');

const paths = {
    input: './sample.csv',
    output: './output.csv',
    error: './errors.csv',
}

var input  = fs.createReadStream(paths.input);
var output = fs.createWriteStream(paths.output);
var error  = fs.createWriteStream(paths.error);

var stringifier = stringify({
    header: true,
    quotedString: true,
});
var parser = parse({
    relax: true,
    delimiter: ',', 
    columns: true, 
    //relax_column_count: true,
})
var transformer = transform((record, callback) => {
    callback(null, record);
}, {parallel: 10});

input.pipe(parser).pipe(transformer).pipe(stringifier).pipe(output);

Thoughts?

I developed a solution to this problem. It doesn't use the pipe API , but instead uses the CSV package's callback API instead. It's less elegant than I would have liked it to be, but it's functional and has the benefit of explicit error handling which doesn't cause the process to come to a grinding halt on an inconsistent number of columns.

The process reads the file in line by line, parses the line against a list of expected fields in the settings object ( settings.mapping ), and then transforms, stringifies, and writes the resulting line of output to the new csv.

I set it up to log the errors due to the number of columns inconsistent with the header to the file along with some extra data (the datetime of execution, the row number, and the full line as text for diagnostic information. I didn't set up logging of the other error types, since they are all downstream of the csv structural errors, but you could modify the code to write those errors as well. (You could also probably write them to JSON or a MySQL database, but one thing at a time).

The good news is appears that there's not a huge performance hit from using this approach over a straight approach. I haven't done any formal performance testing, but on a 60MB file the performance is roughly the same between the two methods (assuming the file has no inconsistent rows). A definite next step is to look into bundling the writes to disk to reduce I/O.

I'm still very interested in if there's a better way to do this, so if you've got an idea, definitely post an answer! In the meantime, I figured I'd post this working answer in case it's useful for others struggling with the same types of inconsistently formatted sources.


Credit where credit is due, specifically to two questions/answers:

  • parsing huge logfiles in Node.js - read in line-by-line
    • This answer adapts some of the core code from the answers which split files to be read line by line, which prevents the csv-parse component from shutting down at a failed line (at the expense of the code overhead from splitting the file further upstream). I actually really recommend the use of iconv-lite as it's done in that post, but it wasn't germane to the minimally reproducible example, so I removed it for this post.
  • Error handling with node.js streams
    • This was generally helpful in better understanding the potential and limitations of pipes. It looks like there's theoretically a way to put what essentially amounts to a pipe splitter onto the outbound pipe from parser, but given my current time constraint and the challenges associated with an async process which would be fairly unpredictable in terms of stream termination, I used the callback API instead.

Sample code:

'use strict'
// Dependencies
const es     = require('event-stream');
const fs     = require('fs');
const parse = require('csv-parse');
const stringify = require('csv-stringify');
const transform = require('stream-transform');

// Reference objects
const paths = {
    input: 'path to input.csv',
    output: 'path to output.csv',
    error: 'path to error output.csv',
}
const settings = {
    mapping: {
        // Each field is an object with the field name as the key
        // and can have additional properties for use in the transform 
        // component of this process
        // Example
        'year' : {
            import: true,
        }
    }
}

const metadata = {
    records: 0,
    error: 0
}

// Set up streams
var input  = fs.createReadStream(paths.input);
var errors  = fs.createWriteStream(paths.error,  {flags: 'ax'});
var output = fs.createWriteStream(paths.output, {flags: 'ax'});

// Begin process (can be refactored into function, but simplified here)
input
  .pipe(es.split()) // split based on row, assumes \n row endings
  .pipe(es.mapSync(line => { // synchronously process each line

    // Remove headers, specified through settings
    if (metadata.records === 0) return metadata.records++;
    var id = metadata.records;

    // Parse csv by row 
    parse(line, {
        relax: true,
        delimiter: ',', 
        columns: Object.keys(settings.mapping),
    }, (error, record) => {

        // Write inconsistent column error 
        if (error) {
            metadata.error++;
            errors.write(
                new Date() + ', Inconsistent Columns, ' + 
                 id + ', `' +  
                 line + '`\n'
            );
        }

    // Apply transform / reduce
    transform(record, (record) => {
        // Do stuff to record
        return record;
    }, (error, record) => {

        // Throw tranform errors
        if (error) {
            throw error;
        }

    // Stringify results and write to new csv
    stringify(record, {
           header: false,
           quotedString: true,
    }, (error, record) => {

        // Throw stringify errors
        if (error) {
            console.log(error);
        }

        // Write record to new csv file
        output.write(record);
    });
    });
    })

    // Increment record count
    metadata.records++;

  }))  
  .on('end', () => {
    metadata.records--;
    console.log(metadata)
  })    

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM