简体   繁体   English

正则表达式使用换行符分割字符串(除非它在双引号之间)

[英]Regex split a string using newline (unless it is between double quotes)

I'm doing some delimited file handling. 我正在做一些分隔文件处理。 The first thing I need to do is get all "lines". 我需要做的第一件事就是获得所有“线条”。 After getting each line, I can split based on the specified delimiter. 获取每一行后,我可以根据指定的分隔符进行拆分。 So, to get the lines I need to split a string using the various line designations (\\r\\n, \\r, \\n). 因此,为了得到我需要使用各种线路指定(\\ r \\ n,\\ r,\\ n)分割字符串的行。 The following was working until I encountered a newline within a double-quote: 以下是工作,直到我在双引号中遇到换行符:

return content.Split(new string[] { "\r\n", "\r", "\n" }, StringSplitOptions.None);

So if you consider the following text (my original text escaped double quotes within double quotes with \\" instead of ""), where each line is delimited by one of the line designations, and each field/column in the line is delimited by the pipe "|" character: 因此,如果您考虑以下文本(我的原始文本在双引号内使用\\“而不是”“转义双引号),其中每一行由一个行名称分隔,并且行中的每个字段/列由分隔符分隔。管道“|”字符:

string s = "row1 col1|\"row1 \"\"col2a\"\"\r\nrow1 col2b\"|row1 col3\nrow2 col1|\"row2 \"\"col2a\"\"\rrow2 \"\"col2b\"\"\"|row2 col3\r\nrow3 col1|\"row3 col2a\nrow3 col2b\"|row3 col3";

Which equals the following string: 其中等于以下字符串:

row1 col1|"row1 ""col2a""{CRLF}row1 ""col2b"""|row1 col3{CRLF}row2 col1|"row2 ""col2a""{CRLF}row2 ""col2b"""|row2 col3{CRLF}row3 col1|"row3 col2a{CRLF}row3 col2b"|row3 col3 row1 col1 |“row1”“col2a”“{CRLF} row1”“col2b”“”| row1 col3 {CRLF} row2 col1 |“row2”“col2a”“{CRLF} row2”“col2b”“”| row2 col3 { CRLF} row3 col1 |“row3 col2a {CRLF} row3 col2b”| row3 col3

Splitting the above with my original method results in 5 lines: 用我原来的方法拆分上面的结果有5行:

string[] result = s.Split(new string[] { "\r\n", "\r", "\n" }, StringSplitOptions.None);

But I would like splitting the above using a line delimiter (\\r\\n, \\r, \\n) to result in 3 lines: 但是我想使用行分隔符(\\ r \\ n,\\ r,\\ n)分割上面的内容以产生3行:

result[0] == "row1 col1|\"row1 \"\"col2a\"\"\r\nrow1 col2b\"|row1 col3"
result[1] == "row2 col1|\"row2 \"\"col2a\"\"\rrow2 \"\"col2b\"\"\"|row2 col3"
result[2] == "row3 col1|\"row3 col2a\nrow3 col2b\"|row3 col3"

Has anyone had some luck coming up with a regex to split on lines (except within quotes)? 有没有人有幸运行正则表达式分裂线(除了引号内)?

Here is what I ended up with, thanks to Alan: 以下是我最终得到的结果,感谢Alan:

public string[] GetLines (string fileContent) {
    Regex regex = new Regex(@"^([^""\r\n]*(?:(?:""[^""]*"")*[^""\r\n]*))", RegexOptions.Multiline);
    MatchCollection matchCollection = regex.Matches(fileContent);
    string[] result = new string[matchCollection.Count];
    for (int i = 0; i < matchCollection.Count; i++) {
        Match match = matchCollection[i];
        result[i] = match.Value;
    }
    return result;
}

I would use Matches() instead of Split() : 我会使用Matches()而不是Split()

Regex r = new Regex(@"(?m)^[^""\r\n]*(?:(?:""[^""]*"")+[^""\r\n]*)*");
MatchCollection m = r.Matches(s);

The inner part, (?:(?:"[^"]*")+ , matches a double-quoted string that may contain escaped quotes. The whole regex matches a line that may contain one or more double-quoted strings. Note that the inner character classes ( [^"] ) can match \\r and \\n , where the outer ones ( [^"\\r\\n] ) explicitly exclude them. The line-start anchor ( ^ in multiline mode) prevents spurious empty matches between real matches. 内部部分(?:(?:"[^"]*")+匹配可能包含转义引号的双引号字符串。整个正则表达式匹配可能包含一个或多个双引号字符串的行。内部字符类( [^"] )可以匹配\\r\\n ,其中外部字符( [^"\\r\\n] )明确地排除它们。行启动锚点(多行模式中的^ )可以防止虚假真实比赛之间的空比赛。

Here's a demo . 这是一个演示 (It's in PCRE, but I've tested it in .NET, too.) (它在PCRE中,但我也在.NET中测试过它。)

You can try following regex: 您可以尝试以下正则表达式:

var fieldSeparator = "|";
var strRx = $@"""[^""\r\n]*""{fieldSeparator}[^|]+(?:\s*)";

var rx = new Regex(strRx);
var data = "row1 col1|\"row1 \\\"col2a\\\"\r\nrow1 \\\"col2b\\\"\"|row1 col3\nrow2 col1|\"row2 col2a\rrow2 col2b\"";

var m = rx.Match(data);

while (m.Success)
{
    Console.WriteLine(m.Value);
    m = m.NextMatch();
}

Just replace fieldSeparator value with whatever field delimiter you want to use. 只需将fieldSeparator值替换为您要使用的任何字段分隔符。

Above code snippet produce following output: 上面的代码片段产生以下输出:

row1 col1
"row1 \"
col2a\"
row1 \"col2b\""
row1 col3
row2 col1
row2 col2b"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM