[英]Using Powershell to output characters (not lines) after a match in a large file
我使用 powershell 來解析大文件並輕松查看文件中出現某個字符串的一小部分.. 像這樣: Select-String P120300420059211107104259.txt -Pattern "<ID>9671510841" -Context 0,300
在該 ID 號出現后,這給了我 300 行文件。
但是我遇到了一個沒有回車的文件。 現在我想做同樣的事情,但不是返回行,我想我需要字符。 我該怎么做? 我從來沒有在 powershell 中創建腳本 - 只是運行了像上面這樣的簡單命令。
我想在一個巨大的文件中看到匹配的字符串后可能有 1000 個字符。 謝謝!
使用Select-String
或[Regex]::Matches()
(或-match
)測試單行文件中是否存在 substring 的問題是,您首先需要將整個文件一次讀入 memory。
好消息是您不需要正則表達式來在巨大的單行文本文件中查找 substring - 相反,您可以將文件內容以較小的塊讀取到 memory 中,然后搜索這些內容 - 這樣您就不需要需要一次將整個文件存儲在 memory 中。
從文件中讀取緩沖文本相當簡單:
StreamReader
從文件 stream 中讀取然后你只需要檢查是否:
然后重復直到找到 substring,此時您閱讀了以下 1000 個字符。
這是一個如何將其實現為腳本 function 的示例(我已嘗試在內聯注釋中更詳細地解釋代碼):
function Find-SubstringWithPostContext {
[CmdletBinding(DefaultParameterSetName = 'wp')]
param(
[Alias('PSPath')]
[Parameter(Mandatory = $true, ParameterSetName = 'lp', ValueFromPipelineByPropertyName = $true, ValueFromPipeline = $true)]
[string[]]$LiteralPath,
[Parameter(Mandatory = $true, ParameterSetName = 'wp', Position = 0)]
[string[]]$Path,
[Parameter(Mandatory = $true)]
[ValidateLength(1, 5000)]
[string]$Substring,
[ValidateRange(2, 25000)]
[int]$PostContext = 1000,
[switch]$All,
[System.Text.Encoding]
$Encoding
)
begin {
# start by ensuring we'll be using a buffer that's at least 4 larger than the
# target substring to avoid too many tail searches
$bufferSize = 2000
while ($Substring.Length -gt $bufferSize / 4) {
$bufferSize *= 2
}
$buffer = [char[]]::new($bufferSize)
}
process {
if ($PSCmdlet.ParameterSetName -eq 'wp') {
# resolve input paths if necessary
$LiteralPath = $Path | Convert-Path
}
:fileLoop
foreach ($lp in $LiteralPath) {
$file = Get-Item -LiteralPath $lp
# skip directories
if ($file -isnot [System.IO.FileInfo]) { continue }
try {
$fileStream = $file.OpenRead()
$scanner = [System.IO.StreamReader]::new($fileStream, $true)
do {
# remember the current offset in the file, we'll need this later
$baseOffset = $fileStream.Position
# read a chunk from the file, convert to string
$readCount = $scanner.ReadBlock($buffer, 0, $bufferSize)
$string = [string]::new($buffer, 0, $readCount)
$eof = $readCount -lt $bufferSize
# test if target substring is found in the chunk we just read
$indexOfTarget = $string.IndexOf($Substring)
if ($indexOfTarget -ge 0) {
Write-Verbose "Substring found in chunk at local index ${indexOfTarget}"
# we found a match, ensure we've read enough post-context ahead of the given index
$tail = ''
if ($string.Length - $indexOfTarget -lt $PostContext -and $readCount -eq $bufferSize) {
# just like above, we read another chunk from the file and convert it to a proper string
$tailBuffer = [char[]]::new($PostContext - ($string.Length - $indexOfTarget))
$tailCount = $scanner.ReadBlock($tailBuffer, 0, $tailBuffer.Length)
$tail = [string]::new($tailBuffer, 0, $tailCount)
}
# construct and output the full post-context
$substringWithPostContext = $string.Substring($indexOfTarget) + $tail
if($substringWithPostContext.Length -gt $PostContext){
$substringWithPostContext = $substringWithPostContext.Remove($PostContext)
}
Write-Verbose "Writing output object ..."
Write-Output $([PSCustomObject]@{
FilePath = $file.FullName
Offset = $baseOffset + $indexOfTarget
Value = $substringWithPostContext
})
if (-not $All) {
# no need to search this file any further unless `-All` was specified
continue fileLoop
}
else {
# rewind to position after this match before next iteration
$rewindOffset = $indexOfTarget - $readCount
$null = $scanner.BaseStream.Seek($rewindOffset, [System.IO.SeekOrigin]::Current)
}
}
else {
# target was not found, but we may have "clipped" it in half,
# so figure out if target string could start at the end of current string chunk
for ($i = $string.Length - $target.Length; $i -lt $string.Length; $i++) {
# if the first character of the target substring isn't found then
# we might as well skip it immediately
if ($string[$i] -ne $target[0]) { continue }
if ($target.StartsWith($string.Substring($i))) {
# rewind file stream to this position so it'll get re-tested on
# the next iteration, then break out of tail search
$rewindOffset = $i - $string.Length
$null = $scanner.BaseStream.Seek($rewindOffset, [System.IO.SeekOrigin]::Current)
break
}
}
}
} until ($eof)
}
finally {
# remember to clean up after searching each file
$scanner, $fileStream |Where-Object { $_ -is [System.IDisposable] } |ForEach-Object Dispose
}
}
}
}
現在,您可以在找到 substring 並使用最少的 memory 分配后准確提取 1000 個字符:
Get-ChildItem P*.txt |Find-SubstringWithPostContext -Substring '<ID>9671510841'
我還沒有對此進行足夠的測試來告訴你它是否正常工作,但它絕對是編碼的樂趣。 你可以試一試,讓我知道它是否有效:)
用法:
Get-ChildItem *.txt | Find-String -Pattern 'mypattern'
Get-ChildItem *.txt | Find-String -Pattern 'mypattern' -Context 20, 20
Get-ChildItem *.txt | Find-String -Pattern 'mypattern' -AllMatches
using namespace System.Text.RegularExpressions
using namespace System.IO
function Find-String {
param(
[parameter(ValueFromPipeline,Mandatory)]
[Alias('PSPath')]
[FileInfo]$Path,
[parameter(Mandatory, Position = 0)]
[string]$Pattern,
[RegexOptions[]]$Options = 'IgnoreCase',
[switch]$AllMatches,
[int[]]$Context = (0, 0)
)
process
{
$re = [regex]::new($Pattern, $Options)
$content = [File]::ReadAllText($Path)
$match = if($AllMatches.IsPresent)
{
$re.Matches($content)
}
else
{
$re.Match($content)
}
if($match.Success -notcontains $true) { return }
foreach($m in $match)
{
$out = [ordered]@{
Path = $path.FullName
Value = $m.Value
Index = $m.Index
}
if($PSBoundParameters.ContainsKey('Context'))
{
$before = $m.Index
$after = $m.Index
$contextBefore = $Context[0]
$contextAfter = $Context[1]
while($contextBefore-- -and $before)
{
$before--
}
while($contextAfter-- -and $after -lt $content.Length)
{
$after++
}
$out.Context = -join $content[$before..$after]
}
[pscustomobject]$out
}
}
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.