简体   繁体   中英

Optimizing log regex in Powershell

We have 2 SMTP gates that spew out text .log files (usually around 10-30MB a pop) for about a weeks worth of data. In total, both are usually around ~1.2GB in size.

I have (2) read-only shares setup to the log directories and am trying to parse log entries using Select-String (eg say I wanted to see if an email by "bdole" came in. If all I wanted was to simply get hits on line numbers, it's not that bad.

However, I want to get the entire "log entry". My initial research says I need to read the entire log's contents at once and then do a regex against that. So, that's what I'm doing, for nearly 200 files.

However, I don't think it's the i/o that is the real issue. I'm spawning ~200 threads (one for each file) and capping out at 20 threads. The initial 20 threads takes some time to run. I put in some debugging code and went back to single-thread; it seems that simply regexing the contents of the one 10-20MB file takes a LONG time.

I suspect that the regex I have written is somehow very inadequate in terms of speed (it works in the sense if I let it run over night, it works fine.) Plus, network I/O is pretty low (peaking at 0.6% of a 2Ggpbs connection) while CPU/RAM are extremely high.

Ideal log entries look like this:

---- SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

The only reliable delimiter is the starting ---- (sometimes it does/doesn't end with a ---- )

The contents of the "log entries" can be extremely variable, including notices of blocked connections, etc.

the regex I am using

(?sm)----((?!----).*?)(log entry)((?!----).*?)(#USERINPUT#)((?!----).*?)----

where #USERINPUT# is being replaced by what is passed to the script.

parsing code After getting a list of filepaths using gci

if ( !(Test-Path $path) ) {
            write-error "issue accessing $path"
        } else {
            try {
                $buffer = [io.file]::ReadAllText($path)
            }
            catch {
                $errArray += $path
                $_
            }
            [string[]]$matchBuffer = @()
            $matchBuffer += $entrySeperator
            $matchBuffer += $_
            $matchBuffer += $entrySeperator
            $matchBuffer += $buffer | Select-String $regex -AllMatches |
            % {$_.Matches} |
            % {$_.Value; $entrySeperator} 

            if ($errArray) {
                write-warning "There were errors, probably in accessing files. "
                $errArray
            }

            $fileName = (gi $path).Name
            sc -path $tmpDir\$fileName -value $matchBuffer
            $matchBuffer | Out-String

I'm almost wondering if parsing the "hits" (eg XXXX.LOG on LINE 21) and working backwards reconstructing the log entry from context would be faster/better.

Description

You have a couple problems with your expression:

  • by including the ---- at the start and end of your match regex, you might end up missing the next entry in the log, and you'll miss the last entry of the log
  • with your construct ((?!----).*?) it looks like you're trying to limit the amount of matches the .*? makes. However the construct only checks once to see that then next 4 characters arn't ---- then it goes on to match the .*? . You would be better off replacing this construct with ((?:(?!----).)*) . Since this construct is self terminating you don't need to worry about the using the ? to prevent greedyness. The bad news is that the construct is slightly less efficient then simply using ([^\\r\\n]*?) to match known entries in the first line and (.*?)(?=^----|\\Z) to match the body of the log.
  • Assuming that the reliable text ---- will always be at the start of a line, then you can also include the start of line anchor ^

(?m)^----\\s(.*?)\\s(log\\sentry)\\s(.*?)\\s(mm\\/dd\\/yyyy\\sHH:mm:ss)(?sm).*?^(.*?)(?=^----|\\Z)

在此处输入图片说明

Example

Powershell Example

$String = '---- 1 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.
---- 2 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss ----
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.
---- 3 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. ----
---- 4 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.
---- 5 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. 
---- 6 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss ----
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.
'
clear

[regex]$Regex = '(?m)^----\s(.*?)\s(log\sentry)\s(.*?)\s(mm\/dd\/yyyy\sHH:mm:ss)(?sm).*?^(.*?)(?=^----|\Z)'
# [regex]$Regex = '(?sm)----((?!----).*?)(log\sentry)((?!----).*?)(mm\/dd\/yyyy\sHH:mm:ss)((?!----).*?)'

# cycle through all matches
$intCount = 0
Measure-Command {
    $Regex.matches($String) | foreach {
            $intCount += 1
            Write-Host "[$intCount][0]=" $_.Groups[0].Value
            Write-Host "[$intCount][1]=" $_.Groups[1].Value
            Write-Host "[$intCount][2]=" $_.Groups[2].Value
            Write-Host "[$intCount][3]=" $_.Groups[3].Value
            Write-Host "[$intCount][4]=" $_.Groups[4].Value
            Write-Host "[$intCount][5]=" $_.Groups[5].Value

        } # next match
    } | select Milliseconds

Output

[1][0]= ---- 1 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[1][1]= 1 SMTPRS
[1][2]= log entry
[1][3]= made at
[1][4]= mm/dd/yyyy HH:mm:ss
[1][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[2][0]= ---- 2 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss ----
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[2][1]= 2 SMTPRS
[2][2]= log entry
[2][3]= made at
[2][4]= mm/dd/yyyy HH:mm:ss
[2][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[3][0]= ---- 3 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. ----

[3][1]= 3 SMTPRS
[3][2]= log entry
[3][3]= made at
[3][4]= mm/dd/yyyy HH:mm:ss
[3][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. ----

[4][0]= ---- 4 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[4][1]= 4 SMTPRS
[4][2]= log entry
[4][3]= made at
[4][4]= mm/dd/yyyy HH:mm:ss
[4][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[5][0]= ---- 5 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. 

[5][1]= 5 SMTPRS
[5][2]= log entry
[5][3]= made at
[5][4]= mm/dd/yyyy HH:mm:ss
[5][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. 

[6][0]= ---- 6 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss ----
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. [6][1]= 6 SMTPRS
[6][2]= log entry
[6][3]= made at
[6][4]= mm/dd/yyyy HH:mm:ss
[6][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. 


Milliseconds
------------
16

Unfortunately on my system this expression runs slightly slower, but I'm not using real data. So I'm curious if you see any improvement with this

You don't necessarily need regular expressions for parsing logs like that. Something like this should work as well:

$userInput = "..."

$logfile = 'C:\path\to\your.log'

$entry = $null
$log = Get-Content $logfile | % {
  $len = [Math]::Min(4, $_.Length)
  if ($_.SubString(0, $len) -eq '----' -and $entry -ne $null) {
    "$entry"
    $entry = $null
  }
  $entry += "$_`n"
}
$log += $entry

$log | ? { $_ -match [regex]::Escape($userInput) }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM