使用Java中的正则表达式验证csv文件

Question

文件结构如下：

"group","type","scope","name","attribute","value"
"c","","Probes Count","Counter","value","35"
"b","ProbeInformation","Probes Count","Gauge","value","0"

总是使用引号。 还有一个尾随换行符。

这是我有的：

^(\"[^,\"]*\")(,(\"[^,\"]*\"))*(.(\"[^,\"]*\")(,(\"[^,\"]*\")))*.$

这不正确匹配。 我正在使用String.matches（regexp）;

Answer 1

免责声明：我甚至没有尝试编译我的代码，但这种模式以前有用。

当我无法一眼就看出正则表达式的作用时，我将其分解为线条，这样就可以更容易地弄清楚正在发生什么。 不匹配的parens更明显，你甚至可以添加注释。 另外，让我们在它周围添加Java代码，以便逃避怪异变得清晰。

^(\"[^,\"]*\")(,(\"[^,\"]*\"))*(.(\"[^,\"]*\")(,(\"[^,\"]*\")))*.$

变

String regex = "^" +
               "(\"[^,\"]*\")" +
               "(," +
                 "(\"[^,\"]*\")" +
               ")*" +
               "(." +
                 "(\"[^,\"]*\")" +
                 "(," +
                    "(\"[^,\"]*\")" +
                 ")" +
               ")*" +
               ".$";

好多了。 现在开始营业：我看到的第一件事就是报价值的正则表达式。 它不允许在字符串中使用逗号 - 这可能不是你想要的 - 所以让我们解决这个问题。 让我们把它放在自己的变量中，这样我们就不会在某个时候错误地输入它。 最后，让我们添加注释，以便我们可以验证正则表达式正在做什么。

final String QUOTED_VALUE = "\"[^\"]*\""; // A double quote character, zero or more non-double quote characters, and another double quote
String regex = "^" +                           // The beginning of the string
               "(" + QUOTED_VALUE + ")" +      // Capture the first value
               "(," +                          // Start a group, a comma
                 "(" + QUOTED_VALUE + ")" +    // Capture the next value
               ")*" +                          // Close the group.  Allow zero or more of these
               "(." +                          // Start a group, any character
                 "(" + QUOTED_VALUE + ")" +      // Capture another value
                 "(," +                            // Started a nested group, a comma
                    "(" + QUOTED_VALUE + ")" +     // Capture the next value
                 ")" +                             // Close the nested group
               ")*" +                            // Close the group.  Allow zero or more
               ".$";                           // Any character, the end of the input

事情变得更加清晰。 我在这看到两件大事：

1）（我认为）您正在尝试匹配输入字符串中的换行符。 我会一直玩，但是在换行上分割输入比你正在做的更清晰，更容易（这是你可以自己做的练习）。 您还需要注意不同操作系统具有的不同换行约定（阅读本文）。

2）你抓得太多了。 您想要使用非捕获组或解析您的输出将是困难且容易出错（阅读此内容）。

final String QUOTED_VALUE = "\"[^\"]*\""; // A double quote character, zero or more non-double quote characters, and another double quote
final String NEWLINE = "(\n|\n\r|\r\n)";  // A newline for (almost) any OS: Windows, *NIX or Mac
String regex = "^" +                           // The beginning of the string
               "(" + QUOTED_VALUE + ")" +   // Capture the first value
               "(?:," +                       // Start a group, a comma
                 "(" + QUOTED_VALUE + ")" + // Capture the next value
               ")*" +                       // Close the group.  Allow zero or more of these
               "(?:" + NEWLINE +            // Start a group, any character
                 "(" + QUOTED_VALUE + ")" +   // Capture another value
                 "(?:," +                       // Started a nested group, a comma
                    "(" + QUOTED_VALUE + ")" +  // Capture the next value
                 ")" +                          // Close the nested group
               ")*" +                         // Close the group.  Allow zero or more
               NEWLINE + "$";                 // A trailing newline, the end of the input

从这里，我看到你再次重复工作。 我们来解决这个问题。 这也修复了原始正则表达式中缺少的*。 看看你是否能找到它。

final String QUOTED_VALUE = "\"[^\"]*\""; // A double quote character, zero or more non-double quote characters, and another double quote
final String NEWLINE = "(\n|\n\r|\r\n)";  // A newline for (almost) any OS: Windows, *NIX or Mac
final String LINE = "(" + QUOTED_VALUE + ")" +   // Capture the first value
                    "(?:," +                       // Start a group, a comma
                      "(" + QUOTED_VALUE + ")" + // Capture the next value
                    ")*";                        // Close the group.  Allow zero or more of these
String regex = "^" +             // The beginning of the string
               LINE +            // Read the first line, capture its values
               "(?:" + NEWLINE + // Start a group for the remaining lines
                 LINE +            // Read more lines, capture their values
               ")*" +            // Close the group.  Allow zero or more
               NEWLINE + "$";    // A trailing newline, the end of the input

这有点容易阅读，不是吗？ 现在你可以测试你的大讨厌的正则表达式，如果它不起作用。

您现在可以编译正则表达式，获取匹配器，并从中获取组。 你仍然有一些问题：

1）我之前说过，打破换行会更容易。 一个原因是：您如何确定每行有多少值？ 硬编码它可以工作，但一旦你的输入改变它就会中断。 也许这对你来说不是问题，但它仍然是不好的做法。 另一个原因：正则表达式仍然太复杂，不符合我的喜好。 你可以在LINE停下来。

2）CSV文件允许这样的行：

"some text","123",456,"some more text"

要处理此问题，您可能需要添加另一个获取引用值或数字列表的迷你正则表达式。

Answer 2

这个问题： CSV中的Java解析指向用于解析CSV的Apache库。

如果您的格式确实是CSV格式，那么正则表达式将很难将数据解析为记录。

我知道这不会直接回答您的问题，您可能会通过使用CSV库轻松获得更多成功。

使用Java中的正则表达式验证csv文件

问题描述

2 个解决方案

解决方案1
2 已采纳 2012-11-21 18:40:13

解决方案2
0 2012-11-21 18:38:56

使用Java中的正则表达式验证csv文件

问题描述

2 个解决方案

解决方案1 2 已采纳 2012-11-21 18:40:13

解决方案2 0 2012-11-21 18:38:56

解决方案1
2 已采纳 2012-11-21 18:40:13

解决方案2
0 2012-11-21 18:38:56