简体   繁体   English

保留以正则表达式过滤的文本中的回车

[英]Retain carriage returns in text filtered through a regular expression

I need to search though a folder of logs and retrieve the most recent logs. 我需要搜索日志文件夹,并检索最新日志。 Then I need to filter each log, pull out the relevant information and save to another file. 然后,我需要过滤每个日志,提取相关信息并保存到另一个文件。

The problem is the regular expression I use to filter the log is dropping the carriage return and the line feed so the new file just contains a jumble of text. 问题是我用来过滤日志的正则表达式删除了回车符和换行符,因此新文件只包含一堆文本。

$Reg = "(?ms)\*{6}\sBEGIN(.|\n){98}13.06.2015(.|\n){104}00000003.*(?!\*\*)+"
get-childitem "logfolder" -filter *.log |
  where-object {$_.LastAccessTime -gt [datetime]$Test.StartTime} | 
  foreach {
     $a=get-content $_;
     [regex]::matches($a,$reg) | foreach {$_.groups[0].value > "MyOutFile"}
  }

Log structure: 日志结构:

******* BEGIN MESSAGE *******

<Info line 1>
Date   18.03.2010 15:07:37   18.03.2010
<Info line 2>      
File Number:  00000003
<Info line 3>   

*Variable number of lines*
******* END MESSAGE *******

Basically capture everything between the BEGIN and END where the dates and file numbers are a certain value. 基本上捕获日期和文件编号为某个值的BEGINEND之间的所有内容。 Does anyone know how I can do this without losing the line feeds? 有谁知道我该怎么做而不丢失换行符? I also tried using Out-File | Select-String -Pattern $reg 我也尝试使用Out-File | Select-String -Pattern $reg Out-File | Select-String -Pattern $reg , but I've never had success with using Select-String on a multiline record. Out-File | Select-String -Pattern $reg ,但是我在多行记录上使用Select-String从未成功。

Wanted to see if I could make that regex better but for now if you are using those regex modes you should be reading your text file in as a single string which helps a lot. 想看看我是否可以使该正则表达式更好,但是现在,如果您正在使用那些正则表达式模式,则应该将文本文件作为单个字符串读取,这会很有帮助。

$a=get-content $_ -Raw

or if you don't have PowerShell 3.0 或者如果您没有PowerShell 3.0

$a=(get-content $_) -join "`r`n"

As @Matt pointed out, you need to read the entire file as a single string if you want to do multiline matches. 正如@Matt指出的,如果要进行多行匹配,则需要将整个文件读取为单个字符串。 Otherwise your (multiline) regular expression would be applied to single lines one after the other. 否则,您的(多行)正则表达式将一个接一个地应用于单行。 There are several ways to get the content of a file as a single string: 有几种方法可以将文件内容作为单个字符串获取:

  • (Get-Content 'C:\\path\\to\\file.txt') -join "`r`n"
  • Get-Content 'C:\\path\\to\\file.txt' | Out-String
  • Get-Content 'C:\\path\\to\\file.txt' -Raw (requires PowerShell v3 or newer) Get-Content 'C:\\path\\to\\file.txt' -Raw (需要PowerShell v3或更高版本)
  • [IO.File]::ReadAllText('C:\\path\\to\\file.txt')

Also, I'd modify the regular expression a little. 另外,我会稍微修改正则表达式。 Most of the time log messages may vary in length, so matching fixed lengths may fail if the log message changes. 大多数时候,日志消息的长度可能会有所不同,因此,如果日志消息发生更改,则匹配固定长度可能会失败。 It's better to match on invariant parts of the string and leave the rest as variable length matches. 最好在字符串的不变部分上进行匹配,然后将其余部分保留为可变长度匹配。 And personally I find it a lot easier to do this kind of content extraction in several steps (makes for simpler regular expressions). 而且我个人觉得分几步进行这种内容提取要容易得多(使正则表达式更简单)。 In your case I would first separate the log entries from each other, and then filter the content: 在您的情况下,我首先将日志条目彼此分开,然后过滤内容:

$date = [regex]::Escape('13.06.2015')
$fnum = '00000003'

$re1 = "(?ms)\*{7} BEGIN MESSAGE \*{7}\s*([\s\S]*?)\*{7} END MESSAGE \*{7}"
$re2 = "(?ms)[\s\S]*?Date\s+$date[\s\S]*?File Number:\s+$fnum[\s\S]*"

Get-ChildItem 'C:\log\folder' -Filter '*.log' | ? {
  $_.LastAccessTime -gt [DateTime]$Test.StartTime
} | % {
  Get-Content $_.FullName -Raw |
    Select-String -Pattern $re1 -AllMatches |
    select -Expand Matches |
    % {
      $_.Groups[1].Value |
        Select-String -Pattern $re2 |
        select -Expand Matches |
        select -Expand Groups |
        select -Expand Value
    }
} | Set-Content 'C:\path\to\output.txt'

BTW, don't use the redirection operator ( > ) inside a loop. 顺便说一句,不要在循环内使用重定向运算符( > )。 It would overwrite the output file's content with each iteration. 它将在每次迭代时覆盖输出文件的内容。 If you must write to a file inside a loop use the append redirection operator instead ( >> ). 如果必须在循环内写入文件,请使用附加重定向运算符( >> )。 However, performance-wise it's usually better to put writing to output files at the end of the pipeline (see above). 但是,从性能角度考虑,通常最好在流水线的末尾写入输出文件(请参见上文)。

I had to solve the problem of disappearing newlines in a completely different context. 我必须解决在完全不同的环境中换行符消失的问题。 What you get when you do a get-content of a text file is an array of records, where each record is a line of text. 当获取文本文件的内容时,您得到的是一组记录数组,其中每条记录都是一行文本。

The only way I found to put the newline back in after some transformation was to use the automatic variable $OFS (output field separator). 我发现在进行一些转换后重新插入换行符的唯一方法是使用自动变量$ OFS(输出字段分隔符)。 The default value is space, but if you set it to carriage return line feed, then you get separate records on separate lines. 默认值为空格,但是如果将其设置为回车换行符,则会在单独的行上获得单独的记录。

So try this (it might work): 因此,请尝试以下操作(可能可行):

$OFS = "`r`n"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM