[英]What is the good way to read data from CSV and converting them to JSON?
我正在嘗試使用 PowerShell 從具有 2200000 條記錄的 CSV 文件中讀取數據,並將每條記錄存儲在 JSON 文件中,但這需要將近 12 個小時。
樣本 CSV 數據:
我們只關心第一列的值。
代碼:
function Read-IPData
{
$dbFilePath = Get-ChildItem -Path $rootDir -Filter "IP2*.CSV" | ForEach-Object{ $_.FullName }
Write-Host "file path - $dbFilePath"
Write-Host "Reading..."
$data = Get-Content -Path $dbFilePath | Select-Object -Skip 1
Write-Host "Reading data finished"
$count = $data.Count
Write-host "Total $count records found"
return $data
}
function Convert-NumbetToIP
{
param(
[Parameter(Mandatory=$true)][string]$number
)
try
{
$w = [int64]($number/16777216)%256
$x = [int64]($number/65536)%256
$y = [int64]($number/256)%256
$z = [int64]$number%256
$ipAddress = "$w.$x.$y.$z"
Write-Host "IP Address - $ipAddress"
return $ipAddress
}
catch
{
Write-Host "$_"
continue
}
}
Write-Host "Getting IP Addresses from $dbFileName"
$data = Read-IPData
Write-Host "Checking whether output.json file exist, if not create"
$outputFile = Join-Path -Path $rootDir -ChildPath "output.json"
if(!(Test-Path $outputFile))
{
Write-Host "$outputFile doestnot exist, creating..."
New-Item -Path $outputFile -type "file"
}
foreach($item in $data)
{
$row = $item -split ","
$ipNumber = $row[0].trim('"')
Write-Host "Converting $ipNumber to ipaddress"
$toIpAddress = Convert-NumbetToIP -number $ipNumber
Write-Host "Preparing document JSON"
$object = [PSCustomObject]@{
"ip-address" = $toIpAddress
"is-vpn" = "true"
"@timestamp" = (Get-Date).ToString("o")
}
$document = $object | ConvertTo-Json -Compress -Depth 100
Write-Host "Adding document - $document"
Add-Content -Path $outputFile $document
}
能否請您幫助優化代碼或有更好的方法來做到這一點。 或者有沒有像多線程這樣的方法。
這是一個可能的優化:
function Get-IPDataPath
{
$dbFilePath = Get-ChildItem -Path $rootDir -Filter "IP2*.CSV" | ForEach-Object FullName | Select-Object -First 1
Write-Host "file path - $dbFilePath"
$dbFilePath # implicit output
}
function Convert-NumberToIP
{
param(
[Parameter(Mandatory=$true)][string]$number
)
[Int64] $numberInt = 0
if( [Int64]::TryParse( $number, [ref] $numberInt ) ) {
if( ($numberInt -ge 0) -and ($numberInt -le 0xFFFFFFFFl) ) {
# Convert to IP address like '192.168.23.42'
([IPAddress] $numberInt).ToString()
}
}
# In case TryParse() returns $false or the number is out of range for an IPv4 address,
# the output of this function will be empty, which converts to $false in a boolean context.
}
$dbFilePath = Get-IPDataPath
$outputFile = Join-Path -Path $rootDir -ChildPath "output.json"
Write-Host "Converting CSV file $dbFilePath to $outputFile"
$object = [PSCustomObject]@{
'ip-address' = ''
'is-vpn' = 'true'
'@timestamp' = ''
}
# Enclose foreach loop in a script block to be able to pipe its output to Set-Content
& {
foreach( $item in [Linq.Enumerable]::Skip( [IO.File]::ReadLines( $dbFilePath ), 1 ) )
{
$row = $item -split ','
$ipNumber = $row[0].trim('"')
if( $ip = Convert-NumberToIP -number $ipNumber )
{
$object.'ip-address' = $ip
$object.'@timestamp' = (Get-Date).ToString('o')
# Implicit output
$object | ConvertTo-Json -Compress -Depth 100
}
}
} | Set-Content -Path $outputFile
提高性能的備注:
Get-Content
,尤其是對於逐行處理,它往往很慢。 一個更快的替代方法是File.ReadLines
方法。 要跳過 header 行,請使用Linq.Enumerable.Skip()
方法。foreach
循環中使用ReadLines
進行惰性枚舉,即每次循環迭代只讀取一行。 這是有效的,因為它返回一個枚舉器而不是行的集合。try
和catch
,因為“異常”代碼路徑非常慢。 而是使用Int64.TryParse()
返回boolean
表示轉換成功。IPAddress
地址 class,它有一個采用 integer 數字的構造函數。 使用其方法.GetAddressBytes()
獲取按網絡(大端)順序排列的字節數組。 最后使用 PowerShell -join
運算符創建預期格式的字符串。[pscustomobject]
,這會產生一些開銷。 在循環之前創建一次,在循環內部只分配值。Write-Host
(或任何 output 到控制台)。與性能無關:
New-Item
調用,這不是必需的,因為如果文件不存在, Set-Content
會自動創建該文件。[ ]
中並在每行,
插入一個逗號。修改處理循環以寫入常規 JSON 文件而不是 NDJSON 文件:
& {
'[' # begin array
$first = $true
foreach( $item in [Linq.Enumerable]::Skip( [IO.File]::ReadLines( $dbFilePath ), 1 ) )
{
$row = $item -split ','
$ipNumber = $row[0].trim('"')
if( $ip = Convert-NumberToIP -number $ipNumber )
{
$object.'ip-address' = $ip
$object.'@timestamp' = (Get-Date).ToString('o')
$row = $object | ConvertTo-Json -Compress -Depth 100
# write array element delimiter if necessary
if( $first ) { $row; $first = $false } else { ",$row" }
}
}
']' # end array
} | Set-Content -Path $outputFile
您可以像下面這樣優化 function Convert-NumberToIP:
function Convert-NumberToIP {
param(
[Parameter(Mandatory=$true)][uint32]$number
)
# either do the math yourself like this:
# $w = ($number -shr 24) -band 255
# $x = ($number -shr 16) -band 255
# $y = ($number -shr 8) -band 255
# $z = $number -band 255
# '{0}.{1}.{2}.{3}' -f $w, $x, $y, $z # output the dotted IP string
# or use .Net:
$n = ([IPAddress]$number).GetAddressBytes()
[array]::Reverse($n)
([IPAddress]$n).IPAddressToString
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.