简体   繁体   English

在java中使用RegEx解析CSV输入

[英]Parsing CSV input with a RegEx in java

I know, now I have two problems. 我知道,现在我有两个问题。 But I'm having fun! 但我很开心!

I started with this advice not to try and split, but instead to match on what is an acceptable field, and expanded from there to this expression. 我从这个建议开始不尝试拆分,而是匹配什么是可接受的字段,并从那里扩展到这个表达式。

final Pattern pattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)");

The expression looks like this without the annoying escaped quotes: 表达式看起来像没有恼人的转义引号:

"([^"]*)"|(?<=,|^)([^,]*)(?=,|$)

This is working well for me - either it matches on "two quotes and whatever is between them", or "something between the start of the line or a comma and the end of the line or a comma". 这对我来说效果很好 - 或者它匹配“两个引号和它们之间的任何东西”,或者“行开头或逗号和行尾或逗号之间的东西”。 Iterating through the matches gets me all the fields, even if they are empty. 通过匹配迭代可以获得所有字段,即使它们是空的。 For instance, 例如,

the quick, "brown, fox jumps", over, "the",,"lazy dog"

breaks down into 分解成

the quick
"brown, fox jumps"
over
"the"

"lazy dog"

Great! 大! Now I want to drop the quotes, so I added the lookahead and lookbehind non-capturing groups like I was doing for the commas. 现在我想删除引号,所以我添加了前瞻和后瞻性非捕获组,就像我为逗号所做的那样。

final Pattern pattern = Pattern.compile("(?<=\")([^\"]*)(?=\")|(?<=,|^)([^,]*)(?=,|$)");

again the expression is: 再次表达的是:

(?<=")([^"]*)(?=")|(?<=,|^)([^,]*)(?=,|$)

Instead of the desired result 而不是期望的结果

the quick
brown, fox jumps
over
the

lazy dog

now I get this breakdown: 现在我得到了这个细分:

the quick
"brown
 fox jumps"
,over,
"the"
,,
"lazy dog"

What am I missing? 我错过了什么?

Operator precedence. 运算符优先级。 Basically there is none. 基本上没有。 It's all left to right. 这一切都是从左到右。 So the or (|) is applying to the closing quote lookahead and the comma lookahead 所以or(|)适用于结束引用前瞻和逗号前瞻

Try: 尝试:

(?:(?<=")([^"]*)(?="))|(?<=,|^)([^,]*)(?=,|$)
(?:^|,)\s*(?:(?:(?=")"([^"].*?)")|(?:(?!")(.*?)))(?=,|$)

This should do what you want. 这应该做你想要的。

Explanation: 说明:

(?:^|,)\s*

The pattern should start with a , or beginning of string. 模式应该以字符串或字符串的开头开头。 Also, ignore all whitespace at the beginning. 另外,忽略开头的所有空格。

Lookahead and see if the rest starts with a quote Lookahead并查看其余部分是否以引号开头

(?:(?=")"([^"].*?)")

If it does, then match non-greedily till next quote. 如果确实如此,则非贪婪地匹配到下一个引用。

(?:(?!")(.*?))

If it does not begin with a quote, then match non-greedily till next comma or end of string. 如果它不以引号开头,则匹配非贪婪直到下一个逗号或字符串结尾。

(?=,|$)

The pattern should end with a comma or end of string. 模式应以逗号或字符串结尾结尾。

When I started to understand what I had done wrong, I also started to understand how convoluted the lookarounds were making this. 当我开始理解我做错了什么时,我也开始明白这些看起来有多么复杂。 I finally realized that I didn't want all the matched text, I wanted specific groups inside of it. 我终于意识到我不想要所有匹配的文本,我想要它内部的特定组。 I ended up using something very similar to my original RegEx except that I didn't do a lookahead on the closing comma, which I think should be a little more efficient. 我最终使用的东西与我原来的RegEx非常相似,只是我没有对结束逗号做一个预测,我认为这应该更有效率。 Here is my final code. 这是我的最终代码。

package regex.parser;

import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class CSVParser {

    /*
     * This Pattern will match on either quoted text or text between commas, including
     * whitespace, and accounting for beginning and end of line.
     */
    private final Pattern csvPattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?:,|$)");  
    private ArrayList<String> allMatches = null;    
    private Matcher matcher = null;
    private String match = null;
    private int size;

    public CSVParser() {        
        allMatches = new ArrayList<String>();
        matcher = null;
        match = null;
    }

    public String[] parse(String csvLine) {
        matcher = csvPattern.matcher(csvLine);
        allMatches.clear();
        String match;
        while (matcher.find()) {
            match = matcher.group(1);
            if (match!=null) {
                allMatches.add(match);
            }
            else {
                allMatches.add(matcher.group(2));
            }
        }

        size = allMatches.size();       
        if (size > 0) {
            return allMatches.toArray(new String[size]);
        }
        else {
            return new String[0];
        }           
    }   

    public static void main(String[] args) {        
        String lineinput = "the quick,\"brown, fox jumps\",over,\"the\",,\"lazy dog\"";

        CSVParser myCSV = new CSVParser();
        System.out.println("Testing CSVParser with: \n " + lineinput);
        for (String s : myCSV.parse(lineinput)) {
            System.out.println(s);
        }
    }

}

我知道这不是OP想要的,但对于其他读者,可以使用String.replace方法之一来去除OPs当前正则表达式的结果数组中每个元素的引号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM