简体   繁体   中英

How to match over multiple lines in Perl when reading input from the pipe

I want to extract data from log files. That data span over multiple lines. The starting line contains a timestamp, a thread id, and some other relevant attributes. The ending line contains the thread id and a elapsed time how long that thread worked was busy. The results should be written into a CVS file for statistical analysis. Each result row should look like: timestamp of the starting line, thread id, the other relevant attributes and the elapsed time.

A 'rearranged' log snipped looks like follows:

2015-08-01 12:23:21.123 | DEBUG | Thread-10 | Received message with id 1234 - com.example.product.module.Receiver#123
2015-08-01 12:23:21.123 | DEBUG | Thread-10 | some log message #1 - com.example.product.module.Helper1#123
2015-08-01 12:23:21.123 | DEBUG | Thread-10 | some log message #2 - com.example.product.module.Helper2#123
2015-08-01 12:23:21.123 | DEBUG | Thread-10 | some log message #3 - com.example.product.module.Helper3#123
2015-08-01 12:23:21.123 | DEBUG | Thread-10 | some log message #4 - com.example.product.module.Helper4#123
2015-08-01 12:23:21.224 | DEBUG | Thread-10 | Message processed in 101ms - com.example.product.module.Receiver#130

2015-08-01 12:24:21.123 | DEBUG | Thread-11 | Received message with id 2345 - com.example.product.module.Receiver#123
2015-08-01 12:24:21.123 | DEBUG | Thread-11 | some log message #1 - com.example.product.module.Helper1#123
2015-08-01 12:24:21.123 | DEBUG | Thread-11 | some log message #2 - com.example.product.module.Helper2#123
2015-08-01 12:24:21.123 | DEBUG | Thread-11 | some log message #3 - com.example.product.module.Helper3#123
2015-08-01 12:24:21.123 | DEBUG | Thread-11 | some log message #4 - com.example.product.module.Helper4#123
2015-08-01 12:24:21.225 | DEBUG | Thread-11 | Message processed in 102ms - com.example.product.module.Receiver#130

2015-08-01 12:25:21.123 | DEBUG | Thread-12 | Received message with id 3456 - com.example.product.module.Receiver#123
2015-08-01 12:25:21.123 | DEBUG | Thread-12 | some log message #1 - com.example.product.module.Helper1#123
2015-08-01 12:25:21.123 | DEBUG | Thread-12 | some log message #2 - com.example.product.module.Helper2#123
2015-08-01 12:25:21.123 | DEBUG | Thread-12 | some log message #3 - com.example.product.module.Helper3#123
2015-08-01 12:25:21.123 | DEBUG | Thread-12 | some log message #4 - com.example.product.module.Helper4#123
2015-08-01 12:25:21.226 | DEBUG | Thread-12 | Message processed in 103ms - com.example.product.module.Receiver#130

But in reality those log messages are mixed, as the application runns multiple threads at the same time. So the real log looks like:

2015-08-01 12:23:21.123 | DEBUG | Thread-10 | Received message with id 1234 - com.example.product.module.Receiver#123
2015-08-01 12:23:21.123 | DEBUG | Thread-10 | some log message #1 - com.example.product.module.Helper1#123
2015-08-01 12:23:21.123 | DEBUG | Thread-10 | some log message #2 - com.example.product.module.Helper2#123
2015-08-01 12:24:21.123 | DEBUG | Thread-11 | Received message with id 2345 - com.example.product.module.Receiver#123
2015-08-01 12:23:21.123 | DEBUG | Thread-10 | some log message #3 - com.example.product.module.Helper3#123
2015-08-01 12:24:21.123 | DEBUG | Thread-11 | some log message #1 - com.example.product.module.Helper1#123
2015-08-01 12:25:21.123 | DEBUG | Thread-12 | Received message with id 3456 - com.example.product.module.Receiver#123
2015-08-01 12:25:21.123 | DEBUG | Thread-12 | some log message #1 - com.example.product.module.Helper1#123
2015-08-01 12:24:21.123 | DEBUG | Thread-11 | some log message #2 - com.example.product.module.Helper2#123
2015-08-01 12:25:21.123 | DEBUG | Thread-12 | some log message #2 - com.example.product.module.Helper2#123
2015-08-01 12:23:21.123 | DEBUG | Thread-10 | some log message #4 - com.example.product.module.Helper4#123
2015-08-01 12:24:21.123 | DEBUG | Thread-11 | some log message #3 - com.example.product.module.Helper3#123
2015-08-01 12:23:21.224 | DEBUG | Thread-10 | Message processed in 101ms - com.example.product.module.Receiver#130
2015-08-01 12:24:21.224 | DEBUG | Thread-11 | some log message #4 - com.example.product.module.Helper4#123
2015-08-01 12:25:21.224 | DEBUG | Thread-12 | some log message #3 - com.example.product.module.Helper3#123
2015-08-01 12:25:21.224 | DEBUG | Thread-12 | some log message #4 - com.example.product.module.Helper4#123
2015-08-01 12:24:21.225 | DEBUG | Thread-11 | Message processed in 102ms - com.example.product.module.Receiver#130
2015-08-01 12:25:21.226 | DEBUG | Thread-12 | Message processed in 103ms - com.example.product.module.Receiver#130

I crafted a regular expression which is able to perform matches on the 'rearranged' log:

/(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3}).*(Thread-\d+).*Received message with id (\d+) [\s\S]+?\2.*Message processed in (\d+)ms/g

The desired result when printing the captured groups ( print "$1;$2;$3;$4\\n"; ) is:

2015-08-01 12:23:21.123;Thread-10;1234;101
2015-08-01 12:24:21.123;Thread-11;2345;102
2015-08-01 12:25:21.123;Thread-12;3456;103

When i use www.regexr.com to try those examples, the run on the rearranged log snipped gives three matches.

My first problem now is: when i want to use the regex in an Perl one-liner i am not able to perform the match over the multiple lines. I think it has something to do with the -n switch which causes to add a loop around the perl code, and the perl code being executed on each line separately.

cat files.log | perl -ne 'next LINE unless /(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\\.\\d{3}).*(Thread-\\d+).*Received message with id (\\d+) [\\s\\S]+?\\2.*Message processed in (\\d+)ms/gm; print "$1;$2;$3;$4\\n";'

The second problem i face is, that on the real log files where the log files are not so nicely arranged i can not extract all possible matches. In the given snipped it is only possible to match one result, not all three which are there.

I tried things like setting the record seperator in the Perl command to undef $\\=undef; , remove the next LINE unless ...

Can anybody point me in a direction which might help?

The logfiles can get rather big (~200mb), so combining all line into one huge String does not seem to be a good approach, although i haven't tried it yet.

Hashes are usually used for grouping.

my %buf;
while (<>) {
   chomp;
   my @fields = split / \| /;
   my $id = $fields[2];
   push @{ $buf{$id} }, \@fields;
   process(@{ delete($buf{$id}) })
      if $fields[3] =~ /^Message processed in /;
}

process is called for each group of rows. Each of process 's parameters is a row, as a reference to an array of fields.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM