用于解析格式化数字的正则表达式

Question

我正在解析包含大量格式化数字的文档，例如：

 Frc consts  --     1.4362                 1.4362                 5.4100
 IR Inten    --     0.0000                 0.0000                 0.0000
 Atom AN      X      Y      Z        X      Y      Z        X      Y      Z
    1   6     0.00   0.00   0.00     0.00   0.00   0.00     0.00   0.00   0.00
    2   1     0.40  -0.20   0.23    -0.30  -0.18   0.36     0.06   0.42   0.26

这些是单独的行，都具有较大的前导空格，并且可能有也可能没有较大的尾随空白。 它们由72,72、78、78和78个字符组成。 我可以推断出字段之间的界限。 这些是可描述的（使用fortran格式（nx = nspaces，an = n alphanum，in = n列中的整数，fm.n = m个字符的浮点数，小数点后n位）：

 (1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
 (1x,a14,1x,f10.4,13x,f10.4,13x,f10.4)
 (1x,a4,a4,3(2x,3a7))
 (1x,2i4,3(2x,3f7.2))
 (1x,2i4,3(2x,3f7.2))

我可能有数千种不同的格式（可以自动生成或导出），并通过描述组件的正则表达式来描述它们。 因此，如果regf10_4表示满足f10.4约束的任何字符串的正则表达式，我都可以创建以下形式的正则表达式：

COMMENTS 
      (\s
      .{14}
      \s
      regf10_4,
      \s{13}
      regf10_4,
      \s{13}
      regf10_4,
)

我想知道是否存在满足这种重复使用要求的正则表达式。 计算机和人类创建与f10.4兼容的数字的方式多种多样。 我相信以下是fortran的所有合法输入和/或输出（我不需要12.4f中的f或d形式的后缀）[SO中的格式应被理解为第一个没有前导空格，第一个没有前导空格。第二等）

-1234.5678
 1234.5678
            // missing number
 12345678.
 1.
 1.0000000
    1.0000
        1.
 0.
        0.
     .1234
    -.1234
    1E2
    1.E2
    1.E02
  -1.0E-02
**********  // number over/underflow

它们必须对相邻字段的内容具有鲁棒性（例如，仅在准确的位置仅检查10个字符。因此，以下内容对于（a1，f5.2，a1）是合法的：

a-1.23b   // -1.23
- 1.23.   // 1.23
3 1.23-   // 1.23

我正在使用Java，因此需要与Java 1.6兼容的正则表达式构造（例如，非perl扩展）

Answer 1

据我了解，每一行包含一个或多个固定宽度的字段，其中可能包含标签，空格或不同种类的数据。 如果您知道字段的宽度和类型，则提取它们的数据只需substring() ， trim()和（可选） Whatever.parseWhatever() 。 正则表达式无法使这项工作变得更轻松-实际上，他们所能做的就是使其变得更加困难。

扫描仪也无济于事。 的确，它具有用于各种值类型的预定义正则表达式，并且可以为您进行转换，但是仍然需要告知每次要查找哪种类型，并且需要用可以识别的分隔符分隔字段。 根据定义，定宽数据不需要定界符。 您可能可以通过先行查找来伪造分隔符，因为该行中应保留许多字符，但这只是使工作变得比原来需要困难的另一种方法。

听起来性能将是一个主要问题。 即使您可以使用正则表达式解决方案，也可能太慢了。 不是因为正则表达式固有的速度慢，而是因为要使它们适应问题，您必须进行扭曲。 我建议您不要为此工作使用正则表达式。

Answer 2

您可以从这里开始，然后从那里开始。

此正则表达式匹配您提供的所有数字。
不幸的是，这也符合3 3 1.23-

// [-+]?(?:[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+)?
// 
// Match a single character present in the list “-+” «[-+]?»
//    Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match the regular expression below «(?:[0-9]+(?:\.[0-9]*)?|\.[0-9]+)»
//    Match either the regular expression below (attempting the next alternative only if this one fails) «[0-9]+(?:\.[0-9]*)?»
//       Match a single character in the range between “0” and “9” «[0-9]+»
//          Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
//       Match the regular expression below «(?:\.[0-9]*)?»
//          Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
//          Match the character “.” literally «\.»
//          Match a single character in the range between “0” and “9” «[0-9]*»
//             Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//    Or match regular expression number 2 below (the entire group fails if this one fails to match) «\.[0-9]+»
//       Match the character “.” literally «\.»
//       Match a single character in the range between “0” and “9” «[0-9]+»
//          Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
// Match the regular expression below «(?:[eE][-+]?[0-9]+)?»
//    Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
//    Match a single character present in the list “eE” «[eE]»
//    Match a single character present in the list “-+” «[-+]?»
//       Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
//    Match a single character in the range between “0” and “9” «[0-9]+»
//       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Pattern regex = Pattern.compile("[-+]?(?:[0-9]+(?:\\.[0-9]*)?|\\.[0-9]+)(?:[eE][-+]?[0-9]+)?");
Matcher matcher = regex.matcher(document);
while (matcher.find()) {
    // matched text: matcher.group()
    // match start: matcher.start()
    // match end: matcher.end()
}

Answer 3

这只是部分答案，但我被警告Java 1.5中的Scanner ，它可以扫描文本并解释数字，从而为该Java实用程序可以扫描和解释的数字提供了BNF。 原则上，我认为BNF可以用于构造正则表达式。

用于解析格式化数字的正则表达式

问题描述

3 个解决方案

解决方案1
2 已采纳 2009-08-26 23:20:21

解决方案2
1 2009-08-26 11:23:08

解决方案3
0 2009-08-26 17:08:11

用于解析格式化数字的正则表达式

问题描述

3 个解决方案

解决方案1 2 已采纳 2009-08-26 23:20:21

解决方案2 1 2009-08-26 11:23:08

解决方案3 0 2009-08-26 17:08:11

解决方案1
2 已采纳 2009-08-26 23:20:21

解决方案2
1 2009-08-26 11:23:08

解决方案3
0 2009-08-26 17:08:11