如何使 Powershell 更快地解析 XML 或進一步優化我的腳本？

Question

我有一個包含 700 萬個 XML 文件的設置，大小從幾 KB 到幾 MB 不等。 總而言之，它大約有 180GB 的 XML 個文件。 我需要執行的工作是分析每個 XML 文件並確定該文件是否包含字符串<ref> ，如果不包含，則將其從當前包含的 Chunk 文件夾中移出到 Referenceless 文件夾中。

我創建的腳本運行良好，但就我的目的而言它非常慢。 它計划在大約 24 天內完成對所有 700 萬個文件的分析，速度約為每秒 3 個文件。 我可以在我的腳本中更改什么以獲得更高的性能嗎？

此外，讓事情變得更復雜的是，我在我的服務器上沒有運行 .PS1 文件的正確權限，因此腳本需要能夠從 PowerShell 在一個命令中運行。 如果我有權限，我會設置權限。

# This script will iterate through the Chunk folders, removing pages that contain no 
# references and putting them into the Referenceless folder.

# Change this variable to start the program on a different chunk. This is the first   
# command to be run in Windows PowerShell. 
$chunknumber = 1
#This while loop is the second command to be run in Windows PowerShell. It will stop after completing Chunk 113.
while($chunknumber -le 113){
#Jumps the terminal to the correct folder.
cd C:\Wiki_Pages
#Creates an index for the chunk being worked on.
$items = Get-ChildItem -Path "Chunk_$chunknumber"
echo "Chunk $chunknumber Indexed"
#Jumps to chunk folder.
cd C:\Wiki_Pages\Chunk_$chunknumber
#Loops through the index. Each entry is one of the pages.
foreach ($page in $items){
#Creates a variable holding the page's content.
$content = Get-Content $page
#If the page has a reference, then it's echoed.
if($content | Select-String "<ref>" -quiet){echo "Referenced!"}
#if the page doesn't have a reference, it's copied to Referenceless then deleted.
else{
Copy-Item $page C:\Wiki_Pages\Referenceless -force
Remove-Item $page -force
echo "Moved to Referenceless!"
}
}
#The chunk number is increased by one and the cycle continues.
$chunknumber = $chunknumber + 1
}

我對 PowerShell 知之甚少，昨天是我第一次打開這個程序。

Answer 1

您將需要將-ReadCount 0參數添加到您的Get-Content命令以加快它們的速度（這非常有幫助）。 我從這篇很棒的文章中學到了這個技巧，該文章表明對整個文件的內容運行foreach比嘗試通過管道解析它要快。

此外，您可以使用Set-ExecutionPolicy Bypass -Scope Process來在您當前的 Powershell 會話中運行腳本，而無需額外的權限！

Answer 2

PowerShell 管道可能比本機系統調用慢得多。

PowerShell：管道性能

在本文中，在 PowerShell 上執行的兩個等效命令和經典 Windows 命令提示符之間執行性能測試。

PS> grep [0-9] numbers.txt | wc -l > $null
CMD> cmd /c "grep [0-9] numbers.txt | wc -l > nul"

這是它的輸出示例。

PS C:\temp> 1..5 | % { .\perf.ps1 ([Math]::Pow(10, $_)) }

10 iterations

   30 ms  (   0 lines / ms)  grep in PS
   15 ms  (   1 lines / ms)  grep in cmd.exe

100 iterations

   28 ms  (   4 lines / ms)  grep in PS
   12 ms  (   8 lines / ms)  grep in cmd.exe

1000 iterations

  147 ms  (   7 lines / ms)  grep in PS
   11 ms  (  89 lines / ms)  grep in cmd.exe

10000 iterations

 1347 ms  (   7 lines / ms)  grep in PS
   13 ms  ( 786 lines / ms)  grep in cmd.exe

100000 iterations

13410 ms  (   7 lines / ms)  grep in PS
   22 ms  (4580 lines / ms)  grep in cmd.exe

編輯：這個問題的原始答案提到了管道性能以及其他一些建議。 為了保持這篇文章的簡潔，我刪除了其他與管道性能實際上沒有任何關系的建議。

Answer 3

在開始優化之前，您需要准確確定需要優化的位置。 您是否受 I/O 限制（讀取每個文件需要多長時間）？ 內存限制（可能不是）？ CPU 限制（搜索內容的時間）？

你說這些是 XML 文件； 您是否測試過將文件讀入 XML 對象（而不是純文本），並通過 XPath 定位<ref>節點？ 然后你會：

$content = [xml](Get-Content $page)
#If the page has a reference, then it's echoed.
if($content.SelectSingleNode("//ref") -quiet){echo "Referenced!"}

如果您有空閑的 CPU、內存和 I/O 資源，您可能會通過並行搜索多個文件看到一些改進。 請參閱有關並行運行多個作業的討論。 顯然，您不能同時運行大量數字，但通過一些測試，您可以找到最佳點（可能在 3-5 附近）。 foreach ($page in $items){所有內容foreach ($page in $items){都將成為作業的腳本塊。

Answer 4

我會嘗試使用 Start-Job cmdlet 一次解析 5 個文件。 有很多關於 PowerShell 作業的優秀文章。 如果由於某種原因沒有幫助，並且您遇到 I/O 或實際資源瓶頸，您甚至可以使用 Start-Job 和 WinRM 在其他機器上啟動工作程序。

Answer 5

如果將 xml 加載到變量中，它也比 Get-Content 快得多。

Measure-Command {
    $xml = [xml]''
    $xml.Load($xmlFilePath)
}

Measure-Command {
    [xml]$xml = Get-Content $xmlFilePath -ReadCount 0
}

在我的測量中，它至少快了 4 倍。

如何使 Powershell 更快地解析 XML 或進一步優化我的腳本？

問題描述

5 個解決方案

解決方案1
4 2012-06-30 19:07:31

解決方案2
2 2012-06-30 16:12:42

解決方案3
0 2012-07-01 01:11:32

解決方案4
0 2012-07-01 17:14:28

解決方案5
0 2022-05-05 13:58:14

如何使 Powershell 更快地解析 XML 或進一步優化我的腳本？

問題描述

5 個解決方案

解決方案1 4 2012-06-30 19:07:31

解決方案2 2 2012-06-30 16:12:42

解決方案3 0 2012-07-01 01:11:32

解決方案4 0 2012-07-01 17:14:28

解決方案5 0 2022-05-05 13:58:14

解決方案1
4 2012-06-30 19:07:31

解決方案2
2 2012-06-30 16:12:42

解決方案3
0 2012-07-01 01:11:32

解決方案4
0 2012-07-01 17:14:28

解決方案5
0 2022-05-05 13:58:14