简体   繁体   中英

Regex grouping and optional matches

Firstly: I'm not strong with regex. Now, that's on the table. I am working on building a regex that uses groups, and optional components. The issue I have, is that I need to match a certain number in two different areas, and give them the same group name. This does not appear to work.

So the specific details. I am analyzing a garbage collection log from a JVM. The two lines in question are a full GC, and a regular GC.

I broke these up to make them readable.

Full line:

229980.058: [Full GC 229980.058: 
            [CMS: 2796543K->2796543K(2796544K), **13.3050667** secs]
            2983863K->2872464K(4067264K), 
            [CMS Perm : 325367K->325242K(1048576K)], 13.3054416 secs] 
            [Times: user=13.27 sys=0.03, real=13.31 secs] 

Regular line:

2.752: [GC 2.752: 
       [ParNew: 1143680K->4938K(1270720K), **0.0243534** secs] 
       1143686K->4945K(4067264K), 0.0245283 secs] 
       [Times: user=0.05 sys=0.02, real=0.03 secs] 

As you can see, the Full GC has a CMS/tenured generation as the first field area. The second one has does not have these, as it's just the regular collection.

In order for these to be captured, correcty I've made both the "CMS:" and "ParNew:" section optional to each other. However, I want to pull the time out of each as one group name. (The values I put ** around)

I'm using this regex:

\\d+.\\d+: [(Full\\s)?GC\\s\\d+.\\d+: [(CMS:\\s(?<JVM_TenuredGenHeapUsedBeforeGC>\\d+)+K->(?<JVM_TenuredHeapUsedAfterGC>\\d+)K(\\d+K),\\s(?<JVM_GCTimeTaken>\\d+.\\d+)\\ssecs)? (ParNew:\\s(?\\d+)+K->(?<JVM_NewGenHeapUsedAfterGC>\\d+)K((?<JVM_NewGenHeapSize>\\d+)K),\\s(?<JVM_GCTimeTaken>\\d+.\\d+)\\ssecs)?] .. [edited for brevity]

In short.. Is it possible to use the same group name on different optional matches? They will never be on the same line, so I don't know why I can't pull this of.

Testing this with regexr also seems to fail. Thanks!

The issue I have, is that I need to match a certain number in two different areas, and give them the same group name.

I'd say that's the problem. I haven't tried this, but I saw the change list introducing named groups, and that's just naming a numbered group. So it can't work.

Give them different names and use something like

Objects.firstNonNull(m.group("foo"), m.group("bar"))

if you're sure that at least one of them is non-null (otherwise you get an NPE). Or write your own null-accepting one-liner.

A little experimentation shows that Java does not allow you to define the same capturing group name twice within a regex. The following code generates the following exception:

public class NamedCapturingGroupMain {
    public static void main(String[] args) {
        Pattern p = Pattern.compile("(?<mygroup>a)|(?<mygroup>b)");
    }
}

Exception:

Exception in thread "main" java.util.regex.PatternSyntaxException: Named capturing group <mygroup> is already defined near index 24

The easiest thing to do here would probably be to define two different capturing group names, and use the second one if the first one is null. For example, if you used "JVM_GCTimeTakenFull" and "JVM_GCTimeTakenPartial" and then do something like:

String gcTimeTaken = matcher.group("JVM_GCTimeTakenFull");
if (gcTimeTaken == null) {
    gcTimeTaken = matcher.group("JVM_GCTimeTakenPartial");
}

Edit - I missed the Java tag, if Java doesn't allow duplicate names (and I know
it doesn't support branch reset) you could do this, then test for a match on
Full_GC AND CMS (which lets you interpret the next groups)

Either way, you only need one JVM_GCTimeTaken group.

 # "\\d+\\.\\d+:\\s*\\[(?:(?<Full_GC>Full\\s*GC)|(?<GC>GC))\\s*(?<GC_Val>\\d+\\.\\d+):\\s*\\[(?:(?<CMS>CMS)|(?<ParNew>ParNew)):\\s*(?<HeapUsedBefore>\\d+)K->(?<HeapUsedAfter>\\d+)K\\((?<NewHeapSize>\\d+)K\\),\\s*(?<JVM_GCTimeTaken>\\d+\\.\\d+)\\s*secs\\]"


 \d+ \. \d+ : \s* 
 \[
     (?:
          (?<Full_GC> Full \s* GC )     # (1)
       |  (?<GC> GC )               # (2)
     )
     \s* 
     (?<GC_Val> \d+ \. \d+ )            # (3)
     : 
     \s* 
 \[
     (?:
          (?<CMS> CMS )                 # (4)
       |  (?<ParNew> ParNew )           # (5)
     )
     : \s* 
     (?<HeapUsedBefore> \d+ )           # (6)
     K->
     (?<HeapUsedAfter> \d+ )            # (7)
     K
     \(
     (?<NewHeapSize> \d+ )              # (8)
     K
     \)
     , \s* 
     (?<JVM_GCTimeTaken> \d+ \. \d+ )   # (9)
     \s* 
     secs
 \]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM