简体   繁体   中英

Retain carriage returns in text filtered through a regular expression

I need to search though a folder of logs and retrieve the most recent logs. Then I need to filter each log, pull out the relevant information and save to another file.

The problem is the regular expression I use to filter the log is dropping the carriage return and the line feed so the new file just contains a jumble of text.

$Reg = "(?ms)\*{6}\sBEGIN(.|\n){98}13.06.2015(.|\n){104}00000003.*(?!\*\*)+"
get-childitem "logfolder" -filter *.log |
  where-object {$_.LastAccessTime -gt [datetime]$Test.StartTime} | 
  foreach {
     $a=get-content $_;
     [regex]::matches($a,$reg) | foreach {$_.groups[0].value > "MyOutFile"}
  }

Log structure:

******* BEGIN MESSAGE *******

<Info line 1>
Date   18.03.2010 15:07:37   18.03.2010
<Info line 2>      
File Number:  00000003
<Info line 3>   

*Variable number of lines*
******* END MESSAGE *******

Basically capture everything between the BEGIN and END where the dates and file numbers are a certain value. Does anyone know how I can do this without losing the line feeds? I also tried using Out-File | Select-String -Pattern $reg Out-File | Select-String -Pattern $reg , but I've never had success with using Select-String on a multiline record.

Wanted to see if I could make that regex better but for now if you are using those regex modes you should be reading your text file in as a single string which helps a lot.

$a=get-content $_ -Raw

or if you don't have PowerShell 3.0

$a=(get-content $_) -join "`r`n"

As @Matt pointed out, you need to read the entire file as a single string if you want to do multiline matches. Otherwise your (multiline) regular expression would be applied to single lines one after the other. There are several ways to get the content of a file as a single string:

  • (Get-Content 'C:\\path\\to\\file.txt') -join "`r`n"
  • Get-Content 'C:\\path\\to\\file.txt' | Out-String
  • Get-Content 'C:\\path\\to\\file.txt' -Raw (requires PowerShell v3 or newer)
  • [IO.File]::ReadAllText('C:\\path\\to\\file.txt')

Also, I'd modify the regular expression a little. Most of the time log messages may vary in length, so matching fixed lengths may fail if the log message changes. It's better to match on invariant parts of the string and leave the rest as variable length matches. And personally I find it a lot easier to do this kind of content extraction in several steps (makes for simpler regular expressions). In your case I would first separate the log entries from each other, and then filter the content:

$date = [regex]::Escape('13.06.2015')
$fnum = '00000003'

$re1 = "(?ms)\*{7} BEGIN MESSAGE \*{7}\s*([\s\S]*?)\*{7} END MESSAGE \*{7}"
$re2 = "(?ms)[\s\S]*?Date\s+$date[\s\S]*?File Number:\s+$fnum[\s\S]*"

Get-ChildItem 'C:\log\folder' -Filter '*.log' | ? {
  $_.LastAccessTime -gt [DateTime]$Test.StartTime
} | % {
  Get-Content $_.FullName -Raw |
    Select-String -Pattern $re1 -AllMatches |
    select -Expand Matches |
    % {
      $_.Groups[1].Value |
        Select-String -Pattern $re2 |
        select -Expand Matches |
        select -Expand Groups |
        select -Expand Value
    }
} | Set-Content 'C:\path\to\output.txt'

BTW, don't use the redirection operator ( > ) inside a loop. It would overwrite the output file's content with each iteration. If you must write to a file inside a loop use the append redirection operator instead ( >> ). However, performance-wise it's usually better to put writing to output files at the end of the pipeline (see above).

I had to solve the problem of disappearing newlines in a completely different context. What you get when you do a get-content of a text file is an array of records, where each record is a line of text.

The only way I found to put the newline back in after some transformation was to use the automatic variable $OFS (output field separator). The default value is space, but if you set it to carriage return line feed, then you get separate records on separate lines.

So try this (it might work):

$OFS = "`r`n"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM