简体   繁体   中英

Parsing - Adding a capturing group

I am attempting to use a fairly complex REGEX expression (see REGEX101 demos below), which I amended slightly from one created by an expert on this site. It parses specific patterns of log events:

  • 1 EXE_IN 1 EXE_CO 2 CONTENT_ACCESS 3 CONTENT_ACCESS

These log sequences will always begin with a random selection of EXE_IN or EXE_CO events, preceded sequence numbers. These selections can be any number, in any order. In this case, we just have two EXE events but this may be 200. Or 1. Note that there is a sequence number and we need to capture it.

The second part of the sequence will always be a series of digit-prefaced CONTENT.ACCESS events. Again from 1 to infinity in length.

The following demo shows a working example and probably conveys the concept better than I can : Demo 1

It nicely captures a full match, sequence number, and event in separate groups.

I need to add a timestamp to the pattern (after the sequence number, with a preceding underscore), and then parse this event log eg

  • 1_11/08/2014 23:03EXE_IN1_11/08/2014 23:03EXE_CO2_12/08/2014 09:17CONTENT_ACCESS3_13/08/2014 09:17CONTENT_ACCESS

I need to capture the timestamps as well.

I attempted to adjust the regex expression, with mixed results. Please see this demo: demo2

Ideally I'd like to see something like this for each event:

Match n
Full match  266-308 `2_12/08/2014 09:17CONTENT_ACCESS`
Group 1. 266-267    `2`
Group 2. 268-284    `12/08/2014 09:17`
Group 3. 284-308    `CONTENT_ACCESS`

I hope you can help me. REGEX101 pcre testing is sufficient (for the record, I am using perl-compatible str_match_all_perl function in R) .

Many thanks in advance.

(\\d+)_(.*?)(EXE_CO|EXE_IN|CONTENT_ACCESS)

https://regex101.com/r/EHHcKm/1

Due to comments it was changed to (?:\\G(?!^)(?(?=\\d+_\\d{2}\\/\\d{2}\\/\\d{4}\\s\\d{2}\\:\\d{2}(?:EXE_CO|EXE_IN))(?<!\\d_\\d{2}\\/\\d{2}\\/\\d{4}\\s\\d{2}\\:\\d{2}CONTENT_ACCESS))|(?=(?:\\d+_\\d{2}\\/\\d{2}\\/\\d{4}\\s\\d{2}\\:\\d{2}(?:EXE_CO|EXE_IN))+(?:\\d+_\\d{2}\\/\\d{2}\\/\\d{4}\\s\\d{2}\\:\\d{2}CONTENT_ACCESS)+))(\\d+)_(\\d{2}\\/\\d{2}\\/\\d{4}\\s\\d{2}\\:\\d{2})(EXE_CO|EXE_IN|CONTENT_ACCESS)

https://regex101.com/r/EHHcKm/3

Ans also another version, which is shorter (?:\\G(?!^)(?(?=\\d+_.{16}(?:EXE_CO|EXE_IN))(?<!\\d_.{16}CONTENT_ACCESS))|(?=(?:\\d+_.{16}(?:EXE_CO|EXE_IN))+(?:\\d+_.{16}CONTENT_ACCESS)+))(\\d+)_(.{16})(EXE_CO|EXE_IN|CONTENT_ACCESS)

https://regex101.com/r/EHHcKm/4

And even more shorter (?:\\G(?!^)(?(?=\\d+_.{16}E)(?<!S))|(?=(?:\\d+_.{16}(?:EXE_CO|EXE_IN))+\\d+_.{16}C))(\\d+)_(.{16})(EXE_CO|EXE_IN|CONTENT_ACCESS)

https://regex101.com/r/EHHcKm/5

And super short (?:\\G|(?=\\d+_.{16}E.*CON))(\\d+)_(.*?)(EXE_CO|EXE_IN|CONTENT_ACCESS)

https://regex101.com/r/EHHcKm/8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM