简体   繁体   中英

Powershell remove any lines from big text file containing any of a large number of strings

We have a large (~100MB) text file. We need to remove any lines that contain certain phrases. I would like to use PowerShell to replace the current method of doing it, which uses windows grep and is a .bat file.

The problem is, there are about 95 key phrases. any line containing any of these phrases must be removed.

The list of key phrases is contained in "badPhrases.txt" , line delimited like a regular text file. There are like 100 of them, I don't want to include them in a hard-coded list, but I will if I have to.

I have tried a couple/few comparisons, but my output is always LARGER than my original input file! Or, 0k(empty). What am I doing wrong? I suspect the problem is in the Where-Object filter, but I could be wrong.

[string[]]$arrayFromFile = Get-Content -Path '.\badPhrases.txt'
get-content ".\inputfile.txt" | Where-Object {$_ -notlike $arrayFromFile} | Out-File ".\clean_data.txt" -Force

I've tried -notlike, -notin -notmatch and -notcontains (while flipping the array & the input object around in ways that seemed logical). Such as...

Where-Object {$arrayFromFile -notin $_}
....
Where-Object {$_ -notcontains $arrayFromFile}
....
Where-Object {$_ -notlike arrayFromFile}

I have searched stackOverflow and googled around and I'm not able to find any links that aren't dead, that address this exact use case. There was a "hey scripting guy" reference, but... the link was dead.

Use Select-String , which supports multiple search criteria via an array of strings passed to its
-Pattern parameter:

Select-String -NotMatch -SimpleMatch -Pattern (Get-Content -Path .\badPhrases.txt) .\inputfile.txt |
 Select-Object -ExpandProperty Line | 
   Out-File .\clean_data.txt -Force

Character-encoding caveat: In Windows PowerShell, Out-File creates "Unicode" (UTF-16LE) files by default, where each character is represented by (at least) 2 bytes; in PowerShell [Core] 6+, the default is more sensibly BOM-less UTF-8; use the -Encoding parameter to control the character encoding explicitly.

  • -NotMatch negates the matching, so that only lines not matching any of the pattern strings are output.

  • -SimpleMatch ensures that the patterns are matched literally against the lines of the input file; by default, they're interpreted as regular expressions.

  • Note that matching is case- insensitive by default; use -CaseSensitive , if needed.

  • Since Select-String outputs Microsoft.PowerShell.Commands.MatchInfo instances by default, Select-Object -ExpandProperty Line is needed to extract the lines themselves.

    • Note: In PowerShell 7+, you can use Select-String 's -Raw switch instead.

As for what you tried :

$_ -notlike $arrayFromFile

You cannot use an array as the RHS of string-comparison operators such as -like , -match , -eq - you can only match against one string at a time.

(Apart from that, -like / -notlike match against the entire LHS by default; to match a substring of the LHS, you'd have to put * on either end of the RHS.)

See this answer for more information.

$arrayFromFile -notin $_

$_ -notcontains $arrayFromFile

In principle, you'd have to reverse the operands for containment operators -in and -contains and their negations - the syntax is <array> -contains <value> and <value> -in <array> - but the problem is that that, again, matching of the entire strings is performed either way, so this approach would only work if $arrayFromFile contained full lines present in the input ( -in and -contains implicitly perform per-element -eq comparisons).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM