在大文件中匹配后使用 Powershell 到 output 字符（不是行）

Question

我使用 powershell 來解析大文件並輕松查看文件中出現某個字符串的一小部分.. 像這樣： Select-String P120300420059211107104259.txt -Pattern "<ID>9671510841" -Context 0,300

在該 ID 號出現后，這給了我 300 行文件。

但是我遇到了一個沒有回車的文件。 現在我想做同樣的事情，但不是返回行，我想我需要字符。 我該怎么做？ 我從來沒有在 powershell 中創建腳本 - 只是運行了像上面這樣的簡單命令。

我想在一個巨大的文件中看到匹配的字符串后可能有 1000 個字符。 謝謝！

Answer 1

使用Select-String或[Regex]::Matches() （或-match ）測試單行文件中是否存在 substring 的問題是，您首先需要將整個文件一次讀入 memory。

好消息是您不需要正則表達式來在巨大的單行文本文件中查找 substring - 相反，您可以將文件內容以較小的塊讀取到 memory 中，然后搜索這些內容 - 這樣您就不需要需要一次將整個文件存儲在 memory 中。

從文件中讀取緩沖文本相當簡單：

打開一個可讀文件 stream
創建一個StreamReader從文件 stream 中讀取
開始閱讀！

然后你只需要檢查是否：

在每個塊中找到目標 substring，或
目標 substring 的開始部分位於當前塊的尾部

然后重復直到找到 substring，此時您閱讀了以下 1000 個字符。

這是一個如何將其實現為腳本 function 的示例（我已嘗試在內聯注釋中更詳細地解釋代碼）：

function Find-SubstringWithPostContext {
  [CmdletBinding(DefaultParameterSetName = 'wp')]
  param(
    [Alias('PSPath')]
    [Parameter(Mandatory = $true, ParameterSetName = 'lp', ValueFromPipelineByPropertyName = $true, ValueFromPipeline = $true)]
    [string[]]$LiteralPath,
  
    [Parameter(Mandatory = $true, ParameterSetName = 'wp', Position = 0)]
    [string[]]$Path,
  
    [Parameter(Mandatory = $true)]
    [ValidateLength(1, 5000)]
    [string]$Substring,

    [ValidateRange(2, 25000)]
    [int]$PostContext = 1000,

    [switch]$All,

    [System.Text.Encoding]
    $Encoding
  )

  begin {
    # start by ensuring we'll be using a buffer that's at least 4 larger than the 
    # target substring to avoid too many tail searches
    $bufferSize = 2000
    while ($Substring.Length -gt $bufferSize / 4) {
      $bufferSize *= 2
    }
    $buffer = [char[]]::new($bufferSize)
  }

  process {
    if ($PSCmdlet.ParameterSetName -eq 'wp') {
      # resolve input paths if necessary
      $LiteralPath = $Path | Convert-Path
    }
    
    :fileLoop
    foreach ($lp in $LiteralPath) {
      $file = Get-Item -LiteralPath $lp

      # skip directories
      if ($file -isnot [System.IO.FileInfo]) { continue }
        
      try {
        $fileStream = $file.OpenRead()
        $scanner = [System.IO.StreamReader]::new($fileStream, $true)
        do {
          # remember the current offset in the file, we'll need this later
          $baseOffset = $fileStream.Position

          # read a chunk from the file, convert to string
          $readCount = $scanner.ReadBlock($buffer, 0, $bufferSize)
          $string = [string]::new($buffer, 0, $readCount)
          $eof = $readCount -lt $bufferSize

          # test if target substring is found in the chunk we just read
          $indexOfTarget = $string.IndexOf($Substring)
          if ($indexOfTarget -ge 0) {
            Write-Verbose "Substring found in chunk at local index ${indexOfTarget}"
            # we found a match, ensure we've read enough post-context ahead of the given index
            $tail = ''
            if ($string.Length - $indexOfTarget -lt $PostContext -and $readCount -eq $bufferSize) {
              # just like above, we read another chunk from the file and convert it to a proper string
              $tailBuffer = [char[]]::new($PostContext - ($string.Length - $indexOfTarget))
              $tailCount = $scanner.ReadBlock($tailBuffer, 0, $tailBuffer.Length)
              $tail = [string]::new($tailBuffer, 0, $tailCount)
            }

            # construct and output the full post-context
            $substringWithPostContext = $string.Substring($indexOfTarget) + $tail
            if($substringWithPostContext.Length -gt $PostContext){
              $substringWithPostContext = $substringWithPostContext.Remove($PostContext)
            }
            
            Write-Verbose "Writing output object ..."
            Write-Output $([PSCustomObject]@{
              FilePath = $file.FullName
              Offset = $baseOffset + $indexOfTarget
              Value = $substringWithPostContext
            })

            if (-not $All) {
              # no need to search this file any further unless `-All` was specified
              continue fileLoop
            }
            else {
              # rewind to position after this match before next iteration
              $rewindOffset = $indexOfTarget - $readCount
              $null = $scanner.BaseStream.Seek($rewindOffset, [System.IO.SeekOrigin]::Current)
            }
          }
          else {
            # target was not found, but we may have "clipped" it in half, 
            # so figure out if target string could start at the end of current string chunk
            for ($i = $string.Length - $target.Length; $i -lt $string.Length; $i++) {
              # if the first character of the target substring isn't found then 
              # we might as well skip it immediately
              if ($string[$i] -ne $target[0]) { continue }

              if ($target.StartsWith($string.Substring($i))) {
                # rewind file stream to this position so it'll get re-tested on 
                # the next iteration, then break out of tail search
                $rewindOffset = $i - $string.Length
                $null = $scanner.BaseStream.Seek($rewindOffset, [System.IO.SeekOrigin]::Current)
                break
              }
            }
          }
        } until ($eof)
      }
      finally {
        # remember to clean up after searching each file
        $scanner, $fileStream |Where-Object { $_ -is [System.IDisposable] } |ForEach-Object Dispose
      }
    }
  }
}

現在，您可以在找到 substring 並使用最少的 memory 分配后准確提取 1000 個字符：

Get-ChildItem P*.txt |Find-SubstringWithPostContext -Substring '<ID>9671510841'

Answer 2

我還沒有對此進行足夠的測試來告訴你它是否正常工作，但它絕對是編碼的樂趣。 你可以試一試，讓我知道它是否有效:)

用法：

Get-ChildItem *.txt | Find-String -Pattern 'mypattern'
Get-ChildItem *.txt | Find-String -Pattern 'mypattern' -Context 20, 20
Get-ChildItem *.txt | Find-String -Pattern 'mypattern' -AllMatches

using namespace System.Text.RegularExpressions
using namespace System.IO

function Find-String {
param(
    [parameter(ValueFromPipeline,Mandatory)]
    [Alias('PSPath')]
    [FileInfo]$Path,
    [parameter(Mandatory, Position = 0)]
    [string]$Pattern,
    [RegexOptions[]]$Options = 'IgnoreCase',
    [switch]$AllMatches,
    [int[]]$Context = (0, 0)
)

    process
    {
        $re = [regex]::new($Pattern, $Options)

        $content = [File]::ReadAllText($Path)
        $match = if($AllMatches.IsPresent)
        {
            $re.Matches($content)
        }
        else
        {
            $re.Match($content)
        }
        
        if($match.Success -notcontains $true) { return }

        foreach($m in $match)
        {
            $out = [ordered]@{
                Path = $path.FullName
                Value = $m.Value
                Index = $m.Index
            }

            if($PSBoundParameters.ContainsKey('Context'))
            {
                $before = $m.Index
                $after = $m.Index
                $contextBefore = $Context[0]
                $contextAfter = $Context[1]

                while($contextBefore-- -and $before)
                {
                    $before--
                }

                while($contextAfter-- -and $after -lt $content.Length)
                {
                    $after++
                }
                $out.Context = -join $content[$before..$after]
            }

            [pscustomobject]$out
        }
    }
}

在大文件中匹配后使用 Powershell 到 output 字符（不是行）

問題描述

2 個解決方案

解決方案1
1 2022-01-22 15:04:12

解決方案2
0 2022-01-21 22:44:24

在大文件中匹配后使用 Powershell 到 output 字符（不是行）

問題描述

2 個解決方案

解決方案1 1 2022-01-22 15:04:12

解決方案2 0 2022-01-21 22:44:24

解決方案1
1 2022-01-22 15:04:12

解決方案2
0 2022-01-21 22:44:24