简体   繁体   中英

Fastest way to parse thousands of small files in PowerShell

I have over 16000 inventory log files ranging in size from 3-5 KB on a network share. Sample file looks like this:

## System Info
SystemManufacturer:=:Dell Inc.                
SystemModel:=:OptiPlex GX620               
SystemType:=:X86-based PC
ChassisType:=:6 (Mini Tower)

## System Type
isLaptop=No

I need to put them into a DB, so I started parsing them and creating a custom object for each that I can later use to check duplicates, normalize etc...

Initial parse with a code snippet as in below took about 7.5mins.

Foreach ($invlog in $invlogs) {
    $content = gc $invlog.FullName -ReadCount 0
    foreach ($line in $content) {
        if ($line -match '^#|^\s*$') { continue }
        $invitem,$value=$line -split ':=:'
        [PSCustomObject]@{Name=$invitem;Value=$value}
    }
}

I started optimizing it and after several trial and error ended up with this which takes 2mins and 4 secs:

 Foreach ($invlog in $invlogs) {
        foreach ($line in ([System.IO.File]::ReadLines("$($invlog.FullName)") -match '^\w')  ) {
           $invitem,$value=$line -split ':=:'
           [PSCustomObject]@{name=$invitem;Value=$value}  #2.04mins
        }
    }

I also tried using a hash instead of PSCustomObject, but to my surprise it took much longer (5mins 26secs)

       Foreach ($invlog in $invlogs) {                        
        $hash=@{}        
        foreach ($line in ([System.IO.File]::ReadLines("$($invlog.FullName)") -match $propertyline)  ) {

           $invitem,$value=$line -split ':=:'
           $hash[$invitem]=$value #5.26mins
        }
    }

What would be the fastest method to use here?

See if this is any faster:

Foreach ($invlog in $invlogs) {
@(gc $invlog.FullName -ReadCount 0) -notmatch '^#|^\s*$' |
 foreach {
          $invitem,$value=$line -split ':=:'
          [PSCustomObject]@{Name=$invitem;Value=$value}
         }
}

The -match and -notmatch operators, when appied to an array return all the elements that satisfy the match, so you can eliminate having to test every line for the lines to exclude.

Are you really wanting to create a PS Object for every line, or just one for every file?

If you want one object per file, see if this is any quicker: The multi-line regex eliminates the line array, and a filter is used in place of the foreach to create the hash entries.

 $regex = [regex]'(?ms)^(\w+):=:([^\r]+)'
 filter make-hash { @{$_.groups[1].value = $_.groups[2].value} }

Foreach ($invlog in $invlogs) {
$regex.matches([io.file]::ReadAllText($invlog.fullname)) | make-hash
 }

The objective of switching to using the multi-line regex and [io.file]::ReadAllText] is to simplify what Powershell is doing with the file input internally. The result of [io.file]::ReadAllText() will be a string object, which is a much simpler type of object than the array of strings that [io.file]::ReadAllLines() will produce, and requires less overhead to counstruct internally. A filter is essentially just the Process block of a function - it will run once for every object that comes to it from the pipeline, so it emulates the action of foreach-object, but actually runs slightly faster (I don't know the internals well enough to tell you exactly why). Both of these changes require more coding and only result in a marginal increase in performace. In my testing switching to multi-line gained about .1ms per file, and changing from foreach-object to the filter another .1 ms. You probably don't see these techniques used very often because of the low return compared to the additional coding work required, but it becomes significant when you start to multiply those fractions of a ms by 160K iterations.

Try this:

Foreach ($invlog in $invlogs) {
    $output = @{}
    foreach ($line in ([IO.File]::ReadLines("$($invlog.FullName)") -ne '')  ) {
        if ($line.Contains(":=:")) {
            $item, $value = $line.Split(":=:") -ne '' 
            $output[$item] = $value
        }        

    }

    New-Object PSObject -Property $output
}

As a general rule, Regex is sometimes cool but always slower.

Wouldn't you want an object per system, and not per key-value pair? :S Like this.. By replacing Get-Content to the .Net method you could probably save some time.

Get-ChildItem -Filter *.txt -Path <path to files> | ForEach-Object {

    $ht = @{}

    Get-Content $_ | Where-Object { $_ -match ':=:' } | ForEach-Object {

        $ht[($_ -split ':=:')[0].Trim()] = ($_ -split ':=:')[1].Trim()

    }

    [pscustomobject]$ht

}

ChassisType                          SystemManufacturer                   SystemType                          SystemModel
-----------                          ------------------                   ----------                          -----------
6 (Mini Tower)                       Dell Inc.                            X86-based PC                        OptiPlex GX620

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM