简体   繁体   English

Java Opencsv 解析 csv 与 csv 文件中的(名字中的双引号)和(双引号中的名字中的逗号)列

[英]Java Opencsv parse csv with (double quotes in first name) and (comma in double quoted first name) column in csv file

I have data as follows我有如下数据

ID1,ID2,FIRST_NAME,LAST_NAME,BIRTH_DATE,HA1,HA2,HA3,STATUS,DT
99,13863926H,MAL"COLMHS,ABBOT,1997-04-09,AMKC,RR,RR  ,DE,
89,12973388H,"SAGAR,TARLE",ABDAT,1997-11-02,RNDC,RR,RR  ,DE,
71,88JunkTest,Howdy,Doody,1985-11-02,RNDC,HA,HACLASSTYPE  ,DE,2019-12-25

I am trying to parse the csv using open CSV where in my CSV first name can contain double quotes(MAL"COLMHS) or double quotes with a comma ("SAGAR,TARLE") or first name without a double quote.我正在尝试使用打开的 CSV 解析 csv ,其中在我的 CSV 中,名字可以包含双引号(MAL“COLMHS)或带逗号的双引号,TARLE(”)

So using.withIgnoreQuotations(true) I can parse first row(MAL"COLMHS) but not able to find the solution to parse 2nd row.所以 using.withIgnoreQuotations(true) 我可以解析第一行 (MAL"COLMHS) 但无法找到解析第二行的解决方案。

I tried with the solutions with multiple StackOverflow links but not able to solve them.我尝试了具有多个 StackOverflow 链接的解决方案,但无法解决它们。

I know my CSV file is inconsistent but there are too many of such records present in CSV file from the client and its hard to make it consistent manually so trying to search automated solution.我知道我的 CSV 文件不一致,但是来自客户端的 CSV 文件中存在太多此类记录,并且很难手动使其保持一致,因此尝试搜索自动化解决方案。

 List<Results> beans = new CsvToBeanBuilder<Results>(new FileReader(file.getAbsolutePath()))
                            .withType(Results.class)
                            .withIgnoreQuotations(true)
                            .build().parse();

ERROR错误

java.lang.RuntimeException: Error parsing CSV line: 3. [3491903139,12973388H,SAGAR,TARLE,ABDAT,1997-11-02,RNDC,RR,RR  ,DE,]
    at com.opencsv.bean.CsvToBean.parse(CsvToBean.java:366)
    at com.apds.partner.nycdoc.main.NycDocApplication.main(NycDocApplication.java:81)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.boot.devtools.restart.RestartLauncher.run(RestartLauncher.java:49)
Caused by: com.opencsv.exceptions.CsvRequiredFieldEmptyException: Number of data fields does not match number of headers.
    at com.opencsv.bean.HeaderColumnNameMappingStrategy.verifyLineLength(HeaderColumnNameMappingStrategy.java:110)
    at com.opencsv.bean.AbstractMappingStrategy.populateNewBean(AbstractMappingStrategy.java:313)
    at com.opencsv.bean.concurrent.ProcessCsvLine.processLine(ProcessCsvLine.java:132)
    at com.opencsv.bean.concurrent.ProcessCsvLine.run(ProcessCsvLine.java:85)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
*****

Edit : I tried with SuperCSV also but same issue编辑:我也尝试过 SuperCSV 但同样的问题

You just have a malformed csv file.您只有一个格式错误的 csv 文件。 According to RFC-4180 , section 2.5根据RFC-4180 ,第 2.5 节

If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.如果字段没有用双引号括起来,则双引号可能不会出现在字段内。

and section 2.7和第 2.7 节

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.如果使用双引号将字段括起来,则出现在字段内的双引号必须通过在其前面加上另一个双引号来进行转义。

I've looked up this question , try replacing all single double-quotes with double double-quotes.我查了这个问题,尝试用双双引号替换所有单双引号。 And don't forget to wrap all the field within the double-quotes.并且不要忘记将所有字段包含在双引号内。

In your example, 99,13863926H,"MAL""COLMHS",ABBOT,1997-04-09,AMKC,RR,RR,DE, should work.在您的示例中, 99,13863926H,"MAL""COLMHS",ABBOT,1997-04-09,AMKC,RR,RR,DE,应该可以工作。

UPD: Well, if you do not want to edit manually to make it RFC-compliant I suggest you running this regex : ^(?:\d*,[^,]*,)([^"]\w+(?:"\w+)+)(?:,) against your file to check how many of the wrong records there are. UPD:好吧,如果您不想手动编辑以使其符合 RFC,我建议您运行此正则表达式^(?:\d*,[^,]*,)([^"]\w+(?:"\w+)+)(?:,)针对您的文件检查有多少错误记录。

You may want to use the only capturing group to extract the malformed name and escape it correctly, then write the changes back to the file and re-read it with the parser of your choice.您可能希望使用唯一的捕获组来提取格式错误的名称并将其正确转义,然后将更改写回文件并使用您选择的解析器重新读取它。

I think that the real problem here is that your CSV file is non-conformant.我认为这里真正的问题是您的 CSV 文件不符合要求。

The first data line has 10 fields, one of which contains an unbalanced double-quote.第一个数据行有 10 个字段,其中一个包含不平衡的双引号。

  • If you don't ignore double quotes, then the first data line is not parsable.如果不忽略双引号,则第一行数据不可解析。

  • If you do ignore double quotes, then the second data line has 11 fields.如果您确实忽略了双引号,那么第二个数据行有 11 个字段。

Basically, the first line is malformed.基本上,第一行格式错误。 It should say this:应该这样说:

 99,13863926H,"MAL""COLMHS",ABBOT,1997-04-09,AMKC,RR,RR  ,DE,

I don't think there is a good way to fix this, apart from rejecting the malformed input:除了拒绝格式错误的输入之外,我认为没有解决此问题的好方法:

  • If the problem is bad data, get a human being to fix the (hand created) input file or the data source that the input file is extracted from.如果问题是错误数据,请人修复(手工创建的)输入文件或从中提取输入文件的数据源。

  • If the problem is in the program that is extracting the data and generating the CSV, then fix that .如果问题出在提取数据并生成 CSV 的程序中,请修复问题。

If you really want to parse this input as-is, you will need to implement your own CSV parser by hand to do the job.如果您真的想按原样解析此输入,则需要手动实现自己的 CSV 解析器来完成这项工作。 OpenCSV won't handle this input, and nor will any other standards-based parser. OpenCSV 不会处理这个输入,任何其他基于标准的解析器也不会。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM