简体   繁体   中英

Can Lumenworks CSV parser error when there are too many columns in a row?

I am using Lumenworks.Framework.IO.Csv.CsvReader to read Csv files and would like to detect badly formed files. If a row has fewer columns than the header then it throws LumenWorks.Framework.IO.Csv.MissingFieldCsvException . However, if a row has more columns than the header then it just truncates the row when parsing it. Are there any properties I can set to make it throw? Or another CSV parser that is efficient, easy to use, and will detect this issue?

My test file looks like

Field 1,Field 2,Field 3,Field 4
This,data,looks,ok
But,this,has,too,many,fields

My integration test (NUnit) looks like

[Test, ExpectedException(typeof(MalformedCsvException))]
public void Row_cannot_have_more_fields_than_the_header()
{
    using (var stream = File.OpenText("MoreColumnsThanHeader.csv"))
        new CsvParser().ReadCsv(stream);
}

and my code to read the data

public DataSubmission ReadCsv(StreamReader streamReader)
{
    var data = new DataSubmission();
    using (var reader = new CsvReader(streamReader, true))
    {
        var items = new List<Row>();
        var fieldCount = reader.FieldCount; //this is 4 in the test
        var headers = reader.GetFieldHeaders();
        while (reader.ReadNextRecord()) //reader has a size 4 array for the 6 item row
            items.Add(ReadRow(fieldCount, headers, reader));
        data.Items = items;
    }
    return data;
}

private static Row ReadRow(int fieldCount, IList<string> headers, CsvReader reader)
{
    var item = new Row();
    var fields = new List<Field>();
    for (var index = 0; index < fieldCount; index++)
        fields.Add(ReadField(headers, reader, index));
    item.Fields = fields;
    return item;
}

private static Field ReadField(IList<string> headers, CsvReader reader, int index)
{
    return new Field {FieldName = headers[index], FieldValue = NullifyEmptyString(reader, index)};
}

private static string NullifyEmptyString(CsvReader reader, int index)
{
    return string.IsNullOrWhiteSpace(reader[index]) ? null : reader[index];
}

EDIT: Since creating this question I have changed my CSV parser to use Microsoft.VisualBasic.FileIO.TextFieldParser . It's easy to use, performs well even with large files, and is more robust than the Lumenworks offering. I had issues with the Lumenworks parser when dealing with line breaks in a quoted string. The Microsoft parser handles this well.

Try using the DataTable csv reader ( nuget csvtools ) from Mike Stall.

If in any of the Read methods in DataTable.New you set allowMismatch = false , then it will throw an exception if the number of columns in a given row does not equal the expected number of columns.

The approach I took was to use File.ReadAllLines() and then spin up a CsvReader for each line individually and compare the column count to that of the header row. If there are any records with extra commas the column count will be higher. Something like this:

var rawRecords = File.ReadAllLines(dataFileName);
foreach (string rawRecord in rawRecords)
{
    using (CsvReader csvRawRecord = new CsvReader(new StringReader(rawRecord), false))
    {
        if (csvRawRecord.FieldCount != fileColumnCount)
        {
            return false;
        }
    }
}

Get the FieldCount inside ReadRow and check it against the passed in fieldCount from the header row. If it's greater, then throw an exception.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM