简体   繁体   English

解析文件和正则表达式

[英]parsing file & regex

I have a csv file that looks like this: 我有一个csv文件,看起来像这样:

"2014", "2", "AMC-South", "inpatient", "complication", "1", "2", "2", "13,125.83", "6,562.95"

How can I remove all the quotes and commas separating the items, so it can look like this?: 如何删除所有分隔项目的引号和逗号,使其看起来像这样?:

2014 2 AMC-South inpatient complication 1 2 2 13,125.83 6,562.95

I need this formatting so I can parse the csv file easier (using java). 我需要这种格式,以便我可以更轻松地解析csv文件(使用Java)。 Thanks. 谢谢。

Command line one-liner, using Perl: 使用Perl的命令行一线式:

$ echo '"2014", "2", "AMC-South", "inpatient", "complication", "1", "2", "2", "13,125.83", "6,562.95"'
"2014", "2", "AMC-South", "inpatient", "complication", "1", "2", "2", "13,125.83", "6,562.95"


$ echo '"2014", "2", "AMC-South", "inpatient", "complication", "1", "2", "2", "13,125.83", "6,562.95"' | perl -pe 's/^"//; s/", "/ /g; s/"$//;'
2014 2 AMC-South inpatient complication 1 2 2 13,125.83 6,562.95

Please note that this will only work correctly for CSV where the fields do not contain white space. 请注意,这仅适用于字段不包含空格的CSV格式。 That's the reason the CSV has those " around each field. 这就是CSV在每个字段周围都有那些"的原因。

IMHO you should look for a Java CSV parser module. 恕我直言,您应该寻找Java CSV解析器模块。 It will make life much easier in the long run. 从长远来看,它将使生活更加轻松。

Here is algorithm outline: 这是算法概述:

The java string replace() method returns a string replacing all the old char or CharSequence to new char or CharSequence. Java字符串replace()方法返回一个字符串,该字符串将所有旧的char或CharSequence替换为新的char或CharSequence。

String replaceString = your_string.replace("string_to_replace","[\",]+");

Consider this instead: 考虑一下这个:

replaceAll(String regex, String replacement)

Replaces each substring of this string that matches the given regular expression with the given replacement. 用给定的替换项替换该字符串中与给定的正则表达式匹配的每个子字符串。

Possible Regex 可能的正则表达式

A work around to avoid the CSV issue since multiple values contain commas, you could split around the following characters ", " . 避免出现CSV问题的解决方法,因为多个值包含逗号,您可以在以下字符“,”之间进行拆分。 Then all you need to do is remove the first and last " contained within those elements 然后,您需要做的就是删除这些元素中包含的第一个和最后一个

String[] data = scanner.readLine().split("\", \"");

if(data.length() > 0 && data.length()  <= 10)
{
    data[0].replaceAll("\"", "");
    data[9].replaceAll("\"", "");
}

You could also split around "[\\D+],[\\D+]" and after the array is returned remove any and all " from each string within the array. 您还可以在“ [\\ D +],[\\ D +]”周围进行拆分并在返回数组后,从数组中的每个字符串中删除所有“”

Have you considered using a library to parse data? 您是否考虑过使用库来解析数据? Apache Commons CSV is great for that - https://commons.apache.org/proper/commons-csv/ Apache Commons CSV非常适合-https://commons.apache.org/proper/commons-csv/

File csvData = new File("/path/to/csv");
CSVParser parser = CSVParser.parse(csvData, CSVFormat.DEFAULT);
for (CSVRecord record : parser) {
     ...
}

Regex : ",? 正则表达式",?

Details: 细节:

  • ? Matches between zero and one times 零到一匹配

Java code : Java代码

String text = "\"2014\", \"2\", \"AMC-South\", \"inpatient\", \"complication\", \"1\", \"2\", \"2\", \"13,125.83\", \"6,562.95\"";
text = text.replaceAll("\",?", "");

System.out.println(text);

Output: 输出:

2014 2 AMC-South inpatient complication 1 2 2 13,125.83 6,562.95

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM