简体   繁体   English

正则表达式:重复捕获组

[英]Regex: Repeated capturing groups

I have to parse some tables from an ASCII text file.我必须从 ASCII 文本文件中解析一些表。 Here's a partial sample:这是部分示例:

QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212
RECKITTBEN  192.50  209.00  192.50  201.80    5.21      34      2850     5.707
RUPALIINS   150.00  159.00  150.00  156.25    6.29       4        80      .125
SALAMCRST   164.00  164.75  163.00  163.25    -.45      80      8250    13.505
SINGERBD    779.75  779.75  770.00  773.00    -.89       8        95      .735
SONARBAINS   68.00   69.00   67.50   68.00     .74      11      3050     2.077

The table consists of 1 column of text and 8 columns of floating point numbers.该表由 1 列文本和 8 列浮点数组成。 I'd like to capture each column via regex.我想通过正则表达式捕获每一列。

I'm pretty new to regular expressions.我对正则表达式很陌生。 Here's the faulty regex pattern I came up with:这是我想出的错误的正则表达式模式:

(\S+)\s+(\s+[\d\.\-]+){8}

But the pattern captures only the first and the last columns.但该模式仅捕获第一列和最后一列。 RegexBuddy also emits the following warning: RegexBuddy 还会发出以下警告:

You repeated the capturing group itself.您重复了捕获组本身。 The group will capture only the last iteration.该组将仅捕获最后一次迭代。 Put a capturing group around the repeated group to capture all iterations.在重复组周围放置一个捕获组以捕获所有迭代。

I've consulted their help file, but I don't have a clue as to how to solve this.我已经咨询了他们的帮助文件,但我不知道如何解决这个问题。

How can I capture each column separately?如何分别捕获每一列?

In C# (modified from this example ):在 C# 中(从这个例子修改):

string input = "QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212";
string pattern = @"^(\S+)\s+(\s+[\d.-]+){8}$";
Match match = Regex.Match(input, pattern, RegexOptions.MultiLine);
if (match.Success) {
   Console.WriteLine("Matched text: {0}", match.Value);
   for (int ctr = 1; ctr < match.Groups.Count; ctr++) {
      Console.WriteLine("   Group {0}:  {1}", ctr, match.Groups[ctr].Value);
      int captureCtr = 0;
      foreach (Capture capture in match.Groups[ctr].Captures) {
         Console.WriteLine("      Capture {0}: {1}", 
                           captureCtr, capture.Value);
         captureCtr++; 
      }
   }
}

Output:输出:

Matched text: QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212
...
    Group 2:      1.212
         Capture 0:  11.00
         Capture 1:    11.10
         Capture 2:    11.00
...etc.

Unfortunately you need to repeat the (…) 8 times to get each column separately.不幸的是,您需要重复(…) 8 次才能分别获得每一列。

^(\S+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)$

If code is possible, you can first match those numeric columns as a whole如果代码可以,您可以先将这些数字列作为一个整体进行匹配

>>> rx1 = re.compile(r'^(\S+)\s+((?:[-.\d]+\s+){7}[-.\d]+)$', re.M)
>>> allres = rx1.findall(theAsciiText)

then split the columns by spaces然后按空格拆分列

>>> [[p] + q.split() for p, q in allres]

If you want to know what the warning is appearing for, it's because your capture group matches multiple times (8, as you specified) but the capture variable can only have one value.如果您想知道出现警告的原因,那是因为您的捕获组匹配了多次(如您指定的 8 次),但捕获变量只能有一个值。 It is assigned the last value matched.它被分配到最后匹配的值。

As described in question 1313332 , retrieving these multiple matches is generally not possible with a regular expression, although .NET and Perl 6 have some support for it.问题 1313332 中所述,使用正则表达式通常无法检索这些多个匹配项,尽管 .NET 和 Perl 6 对此有一些支持。

The warning suggests that you could put another group around the whole set, like this:警告建议您可以在整个集合周围放置另一个组,如下所示:

(\S+)\s+((\s+[\d\.\-]+){8})

You would then be able to see all the columns, but of course they would not be separated.然后您将能够看到所有列,但当然它们不会被分开。 Because it's generally not possible to capture them separately, the more common intention is to capture all of it, and the warning helps remind you of this.因为通常不可能单独捕获它们,所以更常见的意图是捕获所有这些,警告有助于提醒您这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM