[英]Get-Childitem massive memory consumption and too slow - how to speed up this?
I will get the total number of bytes of 32 largest files in the folder.我将获得文件夹中 32 个最大文件的总字节数。 But its working very slowly.
但它的工作非常缓慢。 We have about 10TB data.
我们有大约 10TB 的数据。
Command:命令:
$big32 = Get-ChildItem c:\\temp -recurse | Sort-Object length -descending | select-object -first 32 | measure-object -property length –sum
$big32.sum /1gb
I can think of some improvements, especially to memory usage but following should be considerable faster than Get-ChildItem
我可以想到一些改进,特别是对 memory 的使用,但以下应该比
Get-ChildItem
快得多
[System.IO.Directory]::EnumerateFiles('c:\temp', '*.*', [System.IO.SearchOption]::AllDirectories) |
Foreach-Object {
[PSCustomObject]@{
filename = $_
length = [System.IO.FileInfo]::New($_).Length
}
} |
Sort-Object length -Descending |
Select-Object -First 32
Edit编辑
I would look at trying to implement an implit heap to reduce memory usage without hurting performance (possibly even improves it... to be tested)我会考虑尝试实现一个隐式堆以减少 memory 的使用而不影响性能(甚至可能改进它......待测试)
Edit 2编辑 2
If the filenames are not required, the easiest gain on memory is to not include them in the results.如果不需要文件名,memory 最简单的收获是不将它们包含在结果中。
[System.IO.Directory]::EnumerateFiles('c:\temp', '*.*', [System.IO.SearchOption]::AllDirectories) |
Foreach-Object {
[System.IO.FileInfo]::New($_).Length
} |
Sort-Object length -Descending |
Select-Object -First 32
The following implements improvements by only using PowerShell cmdlets.以下仅使用 PowerShell cmdlet 来实现改进。 Using
System.IO.Directory.EnumerateFiles()
as a basis as suggested by this answer might give another performance improvement but you should do your own measurements to compare.使用
System.IO.Directory.EnumerateFiles()
作为此答案所建议的基础可能会带来另一个性能改进,但您应该进行自己的测量以进行比较。
(Get-ChildItem c:\temp -Recurse -File).ForEach('Length') |
Sort-Object -Descending -Top 32 |
Measure-Object -Sum
This should reduce memory consumption considerably as it only sorts an array of numbers instead of an array of FileInfo
objects.这应该会大大减少 memory 的消耗,因为它只对数字数组而不是
FileInfo
对象数组进行排序。 Maybe it's also somewhat faster due to better caching (an array of numbers is stored in a contiguous, cache-friendly block of memory, whereas an array of objects only stores the references in a contiguous way, but the objects themselfs can be scattered all around in memory).也许由于更好的缓存,它也更快一些(数字数组存储在 memory 的连续、缓存友好的块中,而对象数组仅以连续方式存储引用,但对象本身可以分散在各处在记忆中)。
Note the use of .ForEach('Length')
instead of just .Length
because of member enumeration ambiguity .请注意使用
.ForEach('Length')
而不是.Length
因为成员枚举歧义。
By using Sort-Object
parameter -Top
we can get rid of the Select-Object
cmdlet, further reducing pipeline overhead.通过使用
Sort-Object
参数-Top
,我们可以摆脱Select-Object
cmdlet,进一步减少管道开销。
Firstly, if you're going to use Get-ChildItem
then you should pass the -File
switch parameter so that [System.IO.DirectoryInfo]
instances never enter the pipeline.首先,如果您要使用
Get-ChildItem
那么您应该传递 -File 开关参数,以便 [ -File
[System.IO.DirectoryInfo]
实例永远不会进入管道。
Secondly, you're not passing the -Force
switch parameter to Get-ChildItem
, so any hidden files in that directory structure won't be retrieved.其次,您没有将
-Force
开关参数传递给Get-ChildItem
,因此不会检索该目录结构中的任何隐藏文件。
Thirdly, note that your code is retrieving the 32 largest files , not the files with the 32 largest lengths .第三,请注意您的代码正在检索 32 个最大的文件,而不是具有 32 个最大长度的文件。 That is, if files 31, 32, and 33 are all the same length, then file 33 will be arbitrarily excluded from the final count.
也就是说,如果文件 31、32 和 33 的长度都相同,则文件 33 将被任意排除在最终计数之外。 If that distinction is important to you you could rewrite your code like this...
如果这种区别对您很重要,您可以像这样重写您的代码......
$filesByLength = Get-ChildItem -File -Force -Recurse -Path 'C:\Temp\' |
Group-Object -AsHashTable -Property Length
$big32 = $filesByLength.Keys |
Sort-Object -Descending |
Select-Object -First 32 |
ForEach-Object -Process { $filesByLength[$_] } |
Measure-Object -Property Length -Sum
$filesByLength
is a [Hashtable]
that maps from a length to the file(s) with that length. $filesByLength
是一个[Hashtable]
,它从一个长度映射到具有该长度的文件。 The Keys
property contains all of the unique lengths of all of the retrieved files, so we get the 32 largest keys/lengths and use each one to send all the files of that length down the pipeline. Keys
属性包含所有检索到的文件的所有唯一长度,因此我们获得了 32 个最大的键/长度,并使用每一个将所有该长度的文件发送到管道中。
Most importantly, sorting the retrieved files to find the largest ones is problematic for several reasons:最重要的是,对检索到的文件进行排序以找到最大的文件是有问题的,原因如下:
[System.IO.FileInfo]
instances will be present in memory.[System.IO.FileInfo]
实例都将存在于 memory 中。
Sort-Object
buffers the incoming pipeline data, but I imagine it would be some kind of list that doubles in size every time it needs more capacity, leading to further garbage in memory to be cleaned up.Sort-Object
如何缓冲传入的管道数据,但我想它会是某种列表,每次需要更多容量时,它的大小都会翻倍,从而导致 memory 中的更多垃圾被清理。[System.IO.FileInfo]
instances will be accessed a second time to get their Length
property, all the while whatever sorting manipulations (depending on what algorithm Sort-Object
uses) are occurring, too. [System.IO.FileInfo]
实例中的每一个都将被第二次访问以获取它们的Length
属性,同时无论排序操作(取决于Sort-Object
使用的算法)也在发生。 Since we only care about a mere 32 largest files/lengths out of 1.4 million files, what if we only kept track of those 32 instead of all 1.4 million?由于我们只关心 140 万个文件中的 32 个最大文件/长度,如果我们只跟踪这 32 个而不是全部 140 万个会怎样? Consider if we only wanted to find the single largest file...
考虑一下我们是否只想找到单个最大的文件......
$largestFileLength = 0
$largestFile = $null
foreach ($file in Get-ChildItem -File -Force -Recurse -Path 'C:\Temp\')
{
# Track the largest length in a separate variable to avoid two comparisons...
# if ($largestFile -eq $null -or $file.Length -gt $largestFile.Length)
if ($file.Length -gt $largestFileLength)
{
$largestFileLength = $file.Length
$largestFile = $file
}
}
Write-Host -Message "The largest file is named ""$($largestFile.Name)"" and has length $largestFileLength."
As opposed to Get-ChildItem... | Sort-Object -Property Length -Descending | Select-Object -First 1
相对于
Get-ChildItem... | Sort-Object -Property Length -Descending | Select-Object -First 1
Get-ChildItem... | Sort-Object -Property Length -Descending | Select-Object -First 1
Get-ChildItem... | Sort-Object -Property Length -Descending | Select-Object -First 1
, this has the advantage of only one [FileInfo]
object being "in-flight" at a time and the complete set of [System.IO.FileInfo]
s being enumerated only once. Get-ChildItem... | Sort-Object -Property Length -Descending | Select-Object -First 1
,这具有以下优点:一次只有一个[FileInfo]
object 处于“运行中”,并且[System.IO.FileInfo]
的完整集仅被枚举一次。 Now all we need to do is to take the same approach but expanded from 1 file/length "slot" to 32...现在我们需要做的就是采用相同的方法,但从 1 个文件/长度的“槽”扩展到 32...
$basePath = 'C:\Temp\'
$lengthsToKeep = 32
$includeZeroLengthFiles = $false
$listType = 'System.Collections.Generic.List[System.IO.FileInfo]'
# A SortedDictionary[,] could be used instead to avoid having to fully enumerate the Keys
# property to find the new minimum length, but add/remove/retrieve performance is worse
$dictionaryType = "System.Collections.Generic.Dictionary[System.Int64, $listType]"
# Create a dictionary pre-sized to the maximum number of lengths to keep
$filesByLength = New-Object -TypeName $dictionaryType -ArgumentList $lengthsToKeep
# Cache the minimum length currently being kept
$minimumKeptLength = -1L
Get-ChildItem -File -Force -Recurse -Path $basePath |
ForEach-Object -Process {
if ($_.Length -gt 0 -or $includeZeroLengthFiles)
{
$list = $null
if ($filesByLength.TryGetValue($_.Length, [ref] $list))
{
# The current file's length is already being kept
# Add the current file to the existing list for this length
$list.Add($_)
}
else
{
# The current file's length is not being kept
if ($filesByLength.Count -lt $lengthsToKeep)
{
# There are still available slots to keep more lengths
$list = New-Object -TypeName $listType
# The current file's length will occupy an empty slot of kept lengths
}
elseif ($_.Length -gt $minimumKeptLength)
{
# There are no available slots to keep more lengths
# The current file's length is large enough to keep
# Get the list for the minimum length
$list = $filesByLength[$minimumKeptLength]
# Remove the minimum length to make room for the current length
$filesByLength.Remove($minimumKeptLength) |
Out-Null
# Reuse the list for the now-removed minimum length instead of allocating a new one
$list.Clear()
# The current file's length will occupy the newly-vacated slot of kept lengths
}
else
{
# There are no available slots to keep more lengths
# The current file's length is too small to keep
return
}
$list.Add($_)
$filesByLength.Add($_.Length, $list)
$minimumKeptLength = ($filesByLength.Keys | Measure-Object -Minimum).Minimum
}
}
}
# Unwrap the files in each by-length list
foreach ($list in $filesByLength.Values)
{
foreach ($file in $list)
{
$file
}
}
I went with the approach, described above, of retrieving the files with the 32 largest lengths.我采用了上述方法,即检索具有 32 个最大长度的文件。 A
[Dictionary[Int64, List[FileInfo]]]
is used to track those 32 largest lengths and the corresponding files with that length. [Dictionary[Int64, List[FileInfo]]]
用于跟踪这 32 个最大长度以及具有该长度的相应文件。 For each input file, we first check if its length is among the largest so far and, if so, add the file to the existing List[FileInfo]
for that length.对于每个输入文件,我们首先检查它的长度是否是迄今为止最大的,如果是,则将该文件添加到现有的
List[FileInfo]
中以获得该长度。 Otherwise, if there's still room in the dictionary we can unconditionally add the input file and its length, or if the input file is at least bigger than the smallest tracked length we can remove that smallest length and add in its place the input file and its length.否则,如果字典中仍有空间,我们可以无条件添加输入文件及其长度,或者如果输入文件至少大于最小跟踪长度,我们可以删除该最小长度并在其位置添加输入文件及其长度。 Once there are no more input files we output all of the
[FileInfo]
objects from all of the [List[FileInfo]]
s in the [Dictionary[Int64, [List[FileInfo]]]]
.一旦没有更多的输入文件,我们 output 来自
[Dictionary[Int64, [List[FileInfo]]]]
中所有[List[FileInfo]]
的所有[FileInfo]
对象。
I ran this simple benchmarking template...我运行了这个简单的基准测试模板......
1..5 |
ForEach-Object -Process {
[GC]::Collect()
return Measure-Command -Expression {
# Code to test
}
} | Measure-Object -Property 'TotalSeconds' -Minimum -Maximum -Average
...on PowerShell 7.2 against my $Env:WinDir
directory (325,000 files) with these results: ...在 PowerShell 7.2 上针对我的
$Env:WinDir
目录(325,000 个文件),结果如下:
# Code to test |
Minimum![]() |
Maximum![]() |
Average![]() |
Memory usage* ![]() |
---|---|---|---|---|
Get-ChildItem -File -Force -Recurse -Path $Env:WinDir |
69.7240896 ![]() |
79.727841 ![]() |
72.81731518 ![]() |
+260 MB ![]() |
Get $Env:WinDir files with 32 largest lengths using -AsHashtable , Sort-Object ![]() -AsHashtable , Sort-Object 获取最大长度为 32 的$Env:WinDir 文件 |
82.7488729 ![]() |
83.5245153 ![]() |
83.04068032 ![]() |
+1 GB ![]() |
Get $Env:WinDir files with 32 largest lengths using dictionary of by-length lists![]() $Env:WinDir 文件 |
81.6003697 ![]() |
82.7035483 ![]() |
82.15654538 ![]() |
+235 MB ![]() |
* As observed in the Task Manager
→ Details
tab → Memory (active private working set)
column * 如在
Task Manager
→ Details
选项卡 → Memory (active private working set)
列中观察到的
I'm a little disappointed that my solution is only about 1% faster than the code using the Keys
of a [Hashtable]
, but perhaps grouping the files using a compiled cmdlet vs. not grouping or sorting them but with more (interpreted) PowerShell code is a wash.我有点失望,我的解决方案仅比使用
[Hashtable]
的Keys
的代码快约 1%,但可能使用已编译的 cmdlet 对文件进行分组,而不是对它们进行分组或排序,而是使用更多(解释)PowerShell代码是一种洗涤。 Still, the difference in memory usage is significant, though I can't explain why the Get-ChildItem
call to simply enumerate all files ended up using a bit more.尽管如此,memory 用法的差异仍然很大,尽管我无法解释为什么
Get-ChildItem
调用简单地枚举所有文件最终会使用更多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.