简体   繁体   English

Get-Childitem 海量 memory 消耗又太慢了——如何加快这个速度?

[英]Get-Childitem massive memory consumption and too slow - how to speed up this?

I will get the total number of bytes of 32 largest files in the folder.我将获得文件夹中 32 个最大文件的总字节数。 But its working very slowly.但它的工作非常缓慢。 We have about 10TB data.我们有大约 10TB 的数据。

Command:命令:

$big32 = Get-ChildItem c:\\temp -recurse | Sort-Object length -descending | select-object -first 32 | measure-object -property length –sum

$big32.sum /1gb

I can think of some improvements, especially to memory usage but following should be considerable faster than Get-ChildItem我可以想到一些改进,特别是对 memory 的使用,但以下应该比Get-ChildItem快得多

[System.IO.Directory]::EnumerateFiles('c:\temp', '*.*', [System.IO.SearchOption]::AllDirectories) | 
    Foreach-Object {
        [PSCustomObject]@{
            filename = $_
            length = [System.IO.FileInfo]::New($_).Length
        }
    } | 
    Sort-Object length -Descending | 
    Select-Object -First 32

Edit编辑

I would look at trying to implement an implit heap to reduce memory usage without hurting performance (possibly even improves it... to be tested)我会考虑尝试实现一个隐式堆以减少 memory 的使用而不影响性能(甚至可能改进它......待测试)

Edit 2编辑 2

If the filenames are not required, the easiest gain on memory is to not include them in the results.如果不需要文件名,memory 最简单的收获是不将它们包含在结果中。

[System.IO.Directory]::EnumerateFiles('c:\temp', '*.*', [System.IO.SearchOption]::AllDirectories) | 
    Foreach-Object {
        [System.IO.FileInfo]::New($_).Length
    } | 
    Sort-Object length -Descending | 
    Select-Object -First 32

The following implements improvements by only using PowerShell cmdlets.以下仅使用 PowerShell cmdlet 来实现改进。 Using System.IO.Directory.EnumerateFiles() as a basis as suggested by this answer might give another performance improvement but you should do your own measurements to compare.使用System.IO.Directory.EnumerateFiles()作为此答案所建议的基础可能会带来另一个性能改进,但您应该进行自己的测量以进行比较。

(Get-ChildItem c:\temp -Recurse -File).ForEach('Length') | 
    Sort-Object -Descending -Top 32 | 
    Measure-Object -Sum

This should reduce memory consumption considerably as it only sorts an array of numbers instead of an array of FileInfo objects.这应该会大大减少 memory 的消耗,因为它只对数字数组而不是FileInfo对象数组进行排序。 Maybe it's also somewhat faster due to better caching (an array of numbers is stored in a contiguous, cache-friendly block of memory, whereas an array of objects only stores the references in a contiguous way, but the objects themselfs can be scattered all around in memory).也许由于更好的缓存,它也更快一些(数字数组存储在 memory 的连续、缓存友好的块中,而对象数组仅以连续方式存储引用,但对象本身可以分散在各处在记忆中)。

Note the use of .ForEach('Length') instead of just .Length because of member enumeration ambiguity .请注意使用.ForEach('Length')而不是.Length因为成员枚举歧义

By using Sort-Object parameter -Top we can get rid of the Select-Object cmdlet, further reducing pipeline overhead.通过使用Sort-Object参数-Top ,我们可以摆脱Select-Object cmdlet,进一步减少管道开销。

Firstly, if you're going to use Get-ChildItem then you should pass the -File switch parameter so that [System.IO.DirectoryInfo] instances never enter the pipeline.首先,如果您要使用Get-ChildItem那么您应该传递 -File 开关参数,以便 [ -File [System.IO.DirectoryInfo]实例永远不会进入管道。

Secondly, you're not passing the -Force switch parameter to Get-ChildItem , so any hidden files in that directory structure won't be retrieved.其次,您没有将-Force开关参数传递给Get-ChildItem ,因此不会检索该目录结构中的任何隐藏文件。

Thirdly, note that your code is retrieving the 32 largest files , not the files with the 32 largest lengths .第三,请注意您的代码正在检索 32 个最大的文件,而不是具有 32 个最大长度的文件。 That is, if files 31, 32, and 33 are all the same length, then file 33 will be arbitrarily excluded from the final count.也就是说,如果文件 31、32 和 33 的长度都相同,则文件 33 将被任意排除在最终计数之外。 If that distinction is important to you you could rewrite your code like this...如果这种区别对您很重要,您可以像这样重写您的代码......

$filesByLength = Get-ChildItem -File -Force -Recurse -Path 'C:\Temp\' |
    Group-Object -AsHashTable -Property Length
$big32 = $filesByLength.Keys |
    Sort-Object -Descending |
    Select-Object -First 32 |
    ForEach-Object -Process { $filesByLength[$_] } |
    Measure-Object -Property Length -Sum

$filesByLength is a [Hashtable] that maps from a length to the file(s) with that length. $filesByLength是一个[Hashtable] ,它从一个长度映射到具有该长度的文件。 The Keys property contains all of the unique lengths of all of the retrieved files, so we get the 32 largest keys/lengths and use each one to send all the files of that length down the pipeline. Keys属性包含所有检索到的文件的所有唯一长度,因此我们获得了 32 个最大的键/长度,并使用每一个将所有该长度的文件发送到管道中。

Most importantly, sorting the retrieved files to find the largest ones is problematic for several reasons:最重要的是,对检索到的文件进行排序以找到最大的文件是有问题的,原因如下:

  • Sorting cannot start until all of the input data is available, meaning at that point in time all 1.4 million [System.IO.FileInfo] instances will be present in memory.在所有输入数据都可用之前,无法开始排序,这意味着在那个时间点,所有 140 万[System.IO.FileInfo]实例都将存在于 memory 中。
    • I'm not sure how Sort-Object buffers the incoming pipeline data, but I imagine it would be some kind of list that doubles in size every time it needs more capacity, leading to further garbage in memory to be cleaned up.我不确定Sort-Object如何缓冲传入的管道数据,但我想它会是某种列表,每次需要更多容量时,它的大小都会翻倍,从而导致 memory 中的更多垃圾被清理。
  • Each of the 1.4 million [System.IO.FileInfo] instances will be accessed a second time to get their Length property, all the while whatever sorting manipulations (depending on what algorithm Sort-Object uses) are occurring, too. 140 万[System.IO.FileInfo]实例中的每一个都将被第二次访问以获取它们的Length属性,同时无论排序操作(取决于Sort-Object使用的算法)也在发生。

Since we only care about a mere 32 largest files/lengths out of 1.4 million files, what if we only kept track of those 32 instead of all 1.4 million?由于我们只关心 140 万个文件中的 32 个最大文件/长度,如果我们只跟踪这 32 个而不是全部 140 万个会怎样? Consider if we only wanted to find the single largest file...考虑一下我们是否只想找到单个最大的文件......

$largestFileLength = 0
$largestFile = $null

foreach ($file in Get-ChildItem -File -Force -Recurse -Path 'C:\Temp\')
{
    # Track the largest length in a separate variable to avoid two comparisons...
    #     if ($largestFile -eq $null -or $file.Length -gt $largestFile.Length)
    if ($file.Length -gt $largestFileLength)
    {
        $largestFileLength = $file.Length
        $largestFile = $file
    }
}

Write-Host -Message "The largest file is named ""$($largestFile.Name)"" and has length $largestFileLength."

As opposed to Get-ChildItem... | Sort-Object -Property Length -Descending | Select-Object -First 1相对于Get-ChildItem... | Sort-Object -Property Length -Descending | Select-Object -First 1 Get-ChildItem... | Sort-Object -Property Length -Descending | Select-Object -First 1 Get-ChildItem... | Sort-Object -Property Length -Descending | Select-Object -First 1 , this has the advantage of only one [FileInfo] object being "in-flight" at a time and the complete set of [System.IO.FileInfo] s being enumerated only once. Get-ChildItem... | Sort-Object -Property Length -Descending | Select-Object -First 1 ,这具有以下优点:一次只有一个[FileInfo] object 处于“运行中”,并且[System.IO.FileInfo]的完整集仅被枚举一次。 Now all we need to do is to take the same approach but expanded from 1 file/length "slot" to 32...现在我们需要做的就是采用相同的方法,但从 1 个文件/长度的“槽”扩展到 32...

$basePath = 'C:\Temp\'
$lengthsToKeep = 32
$includeZeroLengthFiles = $false

$listType = 'System.Collections.Generic.List[System.IO.FileInfo]'
# A SortedDictionary[,] could be used instead to avoid having to fully enumerate the Keys
# property to find the new minimum length, but add/remove/retrieve performance is worse
$dictionaryType = "System.Collections.Generic.Dictionary[System.Int64, $listType]"

# Create a dictionary pre-sized to the maximum number of lengths to keep
$filesByLength = New-Object -TypeName $dictionaryType -ArgumentList $lengthsToKeep

# Cache the minimum length currently being kept
$minimumKeptLength = -1L

Get-ChildItem -File -Force -Recurse -Path $basePath |
    ForEach-Object -Process {
        if ($_.Length -gt 0 -or $includeZeroLengthFiles)
        {
            $list = $null
            if ($filesByLength.TryGetValue($_.Length, [ref] $list))
            {
                # The current file's length is already being kept
                # Add the current file to the existing list for this length
                $list.Add($_)
            }
            else
            {
                # The current file's length is not being kept

                if ($filesByLength.Count -lt $lengthsToKeep)
                {
                    # There are still available slots to keep more lengths

                    $list = New-Object -TypeName $listType

                    # The current file's length will occupy an empty slot of kept lengths
                }
                elseif ($_.Length -gt $minimumKeptLength)
                {
                    # There are no available slots to keep more lengths
                    # The current file's length is large enough to keep

                    # Get the list for the minimum length
                    $list = $filesByLength[$minimumKeptLength]

                    # Remove the minimum length to make room for the current length
                    $filesByLength.Remove($minimumKeptLength) |
                        Out-Null

                    # Reuse the list for the now-removed minimum length instead of allocating a new one
                    $list.Clear()

                    # The current file's length will occupy the newly-vacated slot of kept lengths
                }
                else
                {
                    # There are no available slots to keep more lengths
                    # The current file's length is too small to keep
                    return
                }
                $list.Add($_)

                $filesByLength.Add($_.Length, $list)
                $minimumKeptLength = ($filesByLength.Keys | Measure-Object -Minimum).Minimum
            }
        }
    }

# Unwrap the files in each by-length list
foreach ($list in $filesByLength.Values)
{
    foreach ($file in $list)
    {
        $file
    }
}

I went with the approach, described above, of retrieving the files with the 32 largest lengths.我采用了上述方法,即检索具有 32 个最大长度的文件。 A [Dictionary[Int64, List[FileInfo]]] is used to track those 32 largest lengths and the corresponding files with that length. [Dictionary[Int64, List[FileInfo]]]用于跟踪这 32 个最大长度以及具有该长度的相应文件。 For each input file, we first check if its length is among the largest so far and, if so, add the file to the existing List[FileInfo] for that length.对于每个输入文件,我们首先检查它的长度是否是迄今为止最大的,如果是,则将该文件添加到现有的List[FileInfo]中以获得该长度。 Otherwise, if there's still room in the dictionary we can unconditionally add the input file and its length, or if the input file is at least bigger than the smallest tracked length we can remove that smallest length and add in its place the input file and its length.否则,如果字典中仍有空间,我们可以无条件添加输入文件及其长度,或者如果输入文件至少大于最小跟踪长度,我们可以删除该最小长度并在其位置添加输入文件及其长度。 Once there are no more input files we output all of the [FileInfo] objects from all of the [List[FileInfo]] s in the [Dictionary[Int64, [List[FileInfo]]]] .一旦没有更多的输入文件,我们 output 来自[Dictionary[Int64, [List[FileInfo]]]]中所有[List[FileInfo]]的所有[FileInfo]对象。

I ran this simple benchmarking template...我运行了这个简单的基准测试模板......

1..5 |
    ForEach-Object -Process {
        [GC]::Collect()

        return Measure-Command -Expression {
            # Code to test
        }
    } | Measure-Object -Property 'TotalSeconds' -Minimum -Maximum -Average

...on PowerShell 7.2 against my $Env:WinDir directory (325,000 files) with these results: ...在 PowerShell 7.2 上针对我的$Env:WinDir目录(325,000 个文件),结果如下:

# Code to test Minimum最低限度 Maximum最大 Average平均 Memory usage* Memory 用法*
Get-ChildItem -File -Force -Recurse -Path $Env:WinDir 69.7240896 69.7240896 79.727841 79.727841 72.81731518 72.81731518 +260 MB +260 MB
Get $Env:WinDir files with 32 largest lengths using -AsHashtable , Sort-Object使用-AsHashtable , Sort-Object获取最大长度为 32 的$Env:WinDir文件 82.7488729 82.7488729 83.5245153 83.5245153 83.04068032 83.04068032 +1 GB +1 GB
Get $Env:WinDir files with 32 largest lengths using dictionary of by-length lists使用按长度列表的字典获取具有 32 个最大长度的$Env:WinDir文件 81.6003697 81.6003697 82.7035483 82.7035483 82.15654538 82.15654538 +235 MB +235 MB

* As observed in the Task ManagerDetails tab → Memory (active private working set) column * 如在Task ManagerDetails选项卡 → Memory (active private working set)列中观察到的

I'm a little disappointed that my solution is only about 1% faster than the code using the Keys of a [Hashtable] , but perhaps grouping the files using a compiled cmdlet vs. not grouping or sorting them but with more (interpreted) PowerShell code is a wash.我有点失望,我的解决方案仅比使用[Hashtable]Keys的代码快约 1%,但可能使用已编译的 cmdlet 对文件进行分组,而不是对它们进行分组或排序,而是使用更多(解释)PowerShell代码是一种洗涤。 Still, the difference in memory usage is significant, though I can't explain why the Get-ChildItem call to simply enumerate all files ended up using a bit more.尽管如此,memory 用法的差异仍然很大,尽管我无法解释为什么Get-ChildItem调用简单地枚举所有文件最终会使用更多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM