简体   繁体   English

连续的列过多时,Lumenworks CSV解析器会出错吗?

[英]Can Lumenworks CSV parser error when there are too many columns in a row?

I am using Lumenworks.Framework.IO.Csv.CsvReader to read Csv files and would like to detect badly formed files. 我正在使用Lumenworks.Framework.IO.Csv.CsvReader读取Csv文件,并希望检测格式错误的文件。 If a row has fewer columns than the header then it throws LumenWorks.Framework.IO.Csv.MissingFieldCsvException . 如果某行的列数少于标题数,那么它将抛出LumenWorks.Framework.IO.Csv.MissingFieldCsvException However, if a row has more columns than the header then it just truncates the row when parsing it. 但是,如果某行的列数多于标题,则它在解析时会截断该行。 Are there any properties I can set to make it throw? 我可以设置任何属性来抛出它吗? Or another CSV parser that is efficient, easy to use, and will detect this issue? 还是另一个高效,易用且可以检测到此问题的CSV解析器?

My test file looks like 我的测试文件看起来像

Field 1,Field 2,Field 3,Field 4
This,data,looks,ok
But,this,has,too,many,fields

My integration test (NUnit) looks like 我的集成测试(NUnit)看起来像

[Test, ExpectedException(typeof(MalformedCsvException))]
public void Row_cannot_have_more_fields_than_the_header()
{
    using (var stream = File.OpenText("MoreColumnsThanHeader.csv"))
        new CsvParser().ReadCsv(stream);
}

and my code to read the data 和我的代码读取数据

public DataSubmission ReadCsv(StreamReader streamReader)
{
    var data = new DataSubmission();
    using (var reader = new CsvReader(streamReader, true))
    {
        var items = new List<Row>();
        var fieldCount = reader.FieldCount; //this is 4 in the test
        var headers = reader.GetFieldHeaders();
        while (reader.ReadNextRecord()) //reader has a size 4 array for the 6 item row
            items.Add(ReadRow(fieldCount, headers, reader));
        data.Items = items;
    }
    return data;
}

private static Row ReadRow(int fieldCount, IList<string> headers, CsvReader reader)
{
    var item = new Row();
    var fields = new List<Field>();
    for (var index = 0; index < fieldCount; index++)
        fields.Add(ReadField(headers, reader, index));
    item.Fields = fields;
    return item;
}

private static Field ReadField(IList<string> headers, CsvReader reader, int index)
{
    return new Field {FieldName = headers[index], FieldValue = NullifyEmptyString(reader, index)};
}

private static string NullifyEmptyString(CsvReader reader, int index)
{
    return string.IsNullOrWhiteSpace(reader[index]) ? null : reader[index];
}

EDIT: Since creating this question I have changed my CSV parser to use Microsoft.VisualBasic.FileIO.TextFieldParser . 编辑:自创建此问题以来,我已将CSV分析器更改为使用Microsoft.VisualBasic.FileIO.TextFieldParser It's easy to use, performs well even with large files, and is more robust than the Lumenworks offering. 它易于使用,即使是大文件也能表现良好,并且比Lumenworks产品更强大。 I had issues with the Lumenworks parser when dealing with line breaks in a quoted string. 处理带引号的字符串中的换行符时,Lumenworks分析器出现问题。 The Microsoft parser handles this well. Microsoft解析器可以很好地处理此问题。

Try using the DataTable csv reader ( nuget csvtools ) from Mike Stall. 尝试使用Mike Stall的DataTable csv阅读器( nuget csvtools )。

If in any of the Read methods in DataTable.New you set allowMismatch = false , then it will throw an exception if the number of columns in a given row does not equal the expected number of columns. 如果在DataTable.New任何Read方法中都设置了allowMismatch = false ,则如果给定行中的列数不等于预期的列数,则它将引发异常

The approach I took was to use File.ReadAllLines() and then spin up a CsvReader for each line individually and compare the column count to that of the header row. 我采用的方法是使用File.ReadAllLines(),然后为每行分别启动CsvReader,并将列数与标题行的列数进行比较。 If there are any records with extra commas the column count will be higher. 如果有任何记录带有多余的逗号,则列数将更高。 Something like this: 像这样:

var rawRecords = File.ReadAllLines(dataFileName);
foreach (string rawRecord in rawRecords)
{
    using (CsvReader csvRawRecord = new CsvReader(new StringReader(rawRecord), false))
    {
        if (csvRawRecord.FieldCount != fileColumnCount)
        {
            return false;
        }
    }
}

Get the FieldCount inside ReadRow and check it against the passed in fieldCount from the header row. 在ReadRow中获取FieldCount,并对照标题行中传入的fieldCount进行检查。 If it's greater, then throw an exception. 如果更大,则抛出异常。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM