简体   繁体   English

使用 Powershell 更改大型 CSV 文件中的分隔符

[英]Changing the Delimiter in a large CSV file using Powershell

I am in need of a way to change the delimiter in a CSV file from a comma to a pipe.我需要一种将 CSV 文件中的分隔符从逗号更改为管道的方法。 Because of the size of the CSV files (~750 Mb to several Gb), using Import-CSV and/or Get-Content is not an option.由于 CSV 文件的大小(~750 Mb 到几 Gb),使用 Import-CSV 和/或 Get-Content 不是一种选择。 What I'm using (and what works, albeit slowly) is the following code:我正在使用的(以及有效的,尽管很慢)是以下代码:

$reader = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser $source
$reader.SetDelimiters(",")

While(!$reader.EndOfData)
{   
    $line = $reader.ReadFields()
    $details = [ordered]@{
                            "Plugin ID" = $line[0]
                            CVE = $line[1]
                            CVSS = $line[2]
                            Risk = $line[3]     
                         }                        
    $export = New-Object PSObject -Property $details
    $export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"    
}

This little loop took nearly 2 minutes to process a 20 Mb file.这个小循环用了将近 2 分钟来处理一个 20 Mb 的文件。 Scaling up at this speed would mean over an hour for the smallest CSV file I'm currently working with.对于我目前正在使用的最小 CSV 文件,以这种速度扩展意味着一个多小时。

I've tried this as well:我也试过这个:

While(!$reader.EndOfData)
{   
    $line = $reader.ReadFields()  

    $details = [ordered]@{
                             # Same data as before
                         }

    $export.Add($details) | Out-Null        
}

$export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"

This is MUCH FASTER but doesn't provide the right information in the new CSV.这要快得多,但没有在新的 CSV 中提供正确的信息。 Instead I get rows and rows of this:相反,我得到了这样的行和行:

"Count"|"IsReadOnly"|"Keys"|"Values"|"IsFixedSize"|"SyncRoot"|"IsSynchronized"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"

So, two questions:所以,两个问题:

1) Can the first block of code be made faster? 1)第一个代码块可以做得更快吗? 2) How can I unwrap the arraylist in the second example to get to the actual data? 2) 如何解开第二个示例中的数组列表以获取实际数据?

EDIT: Sample data found here - http://pastebin.com/6L98jGNg编辑:此处找到的示例数据 - http://pastebin.com/6L98jGNg

This is simple text-processing, so the bottleneck should be disk read speed: 1 second per 100 MB or 10 seconds per 1GB for the OP's sample (repeated to the mentioned size) as measured here on i7.这是简单的文本处理,所以瓶颈应该是磁盘读取速度:对于 OP 的样本(重复到提到的大小),每 100 MB 1 秒或每 1GB 10 秒,如在 i7 上测量的那样。 The results would be worse for files with many/all small quoted fields.对于包含许多/所有小引用字段的文件,结果会更糟。

The algo is simple:算法很简单:

  1. Read the file in big string chunks eg 1MB.以大字符串块读取文件,例如 1MB。
    It's much faster than reading millions of lines separated by CR/LF because:它比读取由 CR/LF 分隔的数百万行要快得多,因为:
    • less checks are performed as we mostly/primarily look only for doublequotes;执行较少的检查,因为我们主要/主要只查找双引号;
    • less iterations of our code executed by the interpreter which is slow.解释器执行的代码迭代较少,这很慢。
  2. Find the next doublequote.找到下一个双引号。
  3. Depending on the current $inQuotedField flag decide whether the found doublequote starts a quoted field (should be preceded by , + some spaces optionally) or ends the current quoted field (should be followed by any even number of doublequotes, optionally spaces, then , ).根据当前的$inQuotedField标志决定找到的双引号是开始一个带引号的字段(前面​​应该是, + 一些可选的空格)还是结束当前的带引号的字段(后面应该是任意偶数个双引号,可选的空格,然后是, ) .
  4. Replace delimiters in the preceding span or to the end of 1MB chunk if no quotes were found.如果未找到引号,则替换前一个跨度或 1MB 块末尾的分隔符。

The code makes some reasonable assumptions but it may fail to detect an escaped field if its doublequote is followed or preceded by more than 3 spaces before/after field delimiter.该代码做出了一些合理的假设,但如果在字段分隔符之前/之后的双引号后面或前面有超过 3 个空格,则它可能无法检测到转义字段。 The checks won't be too hard to add, and I might've missed some other edge case, but I'm not that interested.检查不会太难添加,我可能错过了其他一些边缘情况,但我并不那么感兴趣。

$sourcePath = 'c:\path\file.csv'
$targetPath = 'd:\path\file2.csv'
$targetEncoding = [Text.UTF8Encoding]::new($false) # no BOM

$delim = [char]','
$newDelim = [char]'|'

$buf = [char[]]::new(1MB)
$sourceBase = [IO.FileStream]::new(
    $sourcePath,
    [IO.FileMode]::open,
    [IO.FileAccess]::read,
    [IO.FileShare]::read,
    $buf.length,  # let OS prefetch the next chunk in background
    [IO.FileOptions]::SequentialScan)
$source = [IO.StreamReader]::new($sourceBase, $true) # autodetect encoding
$target = [IO.StreamWriter]::new($targetPath, $false, $targetEncoding, $buf.length)

$bufStart = 0
$bufPadding = 4
$inQuotedField = $false
$fieldBreak = [char[]]@($delim, "`r", "`n")
$out = [Text.StringBuilder]::new($buf.length)

while ($nRead = $source.Read($buf, $bufStart, $buf.length-$bufStart)) {
    $s = [string]::new($buf, 0, $nRead+$bufStart)
    $len = $s.length
    $pos = 0
    $out.Clear() >$null

    do {
        $iQuote = $s.IndexOf([char]'"', $pos)
        if ($inQuotedField) {
            $iDelim = if ($iQuote -ge 0) { $s.IndexOf($delim, $iQuote+1) }
            if ($iDelim -eq -1 -or $iQuote -le 0 -or $iQuote -ge $len - $bufPadding) {
                # no closing quote in buffer safezone
                $out.Append($s.Substring($pos, $len-$bufPadding-$pos)) >$null
                break
            }
            if ($s.Substring($iQuote, $iDelim-$iQuote+1) -match "^(""+)\s*$delim`$") {
                # even number of quotes are just quoted quotes
                $inQuotedField = $matches[1].length % 2 -eq 0
            }
            $out.Append($s.Substring($pos, $iDelim-$pos+1)) >$null
            $pos = $iDelim + 1
            continue
        }
        if ($iQuote -ge 0) {
            $iDelim = $s.LastIndexOfAny($fieldBreak, $iQuote)
            if (!$s.Substring($iDelim+1, $iQuote-$iDelim-1).Trim()) {
                $inQuotedField = $true
            }
            $replaced = $s.Substring($pos, $iQuote-$pos+1).Replace($delim, $newDelim)
        } elseif ($pos -gt 0) {
            $replaced = $s.Substring($pos).Replace($delim, $newDelim)
        } else {
            $replaced = $s.Replace($delim, $newDelim)
        }
        $out.Append($replaced) >$null
        $pos = $iQuote + 1
    } while ($iQuote -ge 0)

    $target.Write($out)

    $bufStart = 0
    for ($i = $out.length; $i -lt $s.length; $i++) {
        $buf[$bufStart++] = $buf[$i]
    }
}
if ($bufStart) { $target.Write($buf, 0, $bufStart) }
$source.Close()
$target.Close()

Still not what I would call fast, but this is considerably faster than what you have listed by using the -Join operator:仍然不是我所说的快速,但这比您使用-Join运算符列出的要-Join

$reader = New-Object Microsoft.VisualBasic.fileio.textfieldparser $source
$reader.SetDelimiters(",")

While(!$reader.EndOfData){
    $line = $reader.ReadFields()
    $line -join '|' | Add-Content C:\Temp\TestOutput.csv
}

That took a hair under 32 seconds to process a 20MB file.处理一个 20MB 的文件只用了不到 32 秒的时间。 At that rate your 750MB file would be done in under 20 minutes, and bigger files should go at about 26 minutes per gig.按照这个速度,您的 750MB 文件将在 20 分钟内完成,而更大的文件应该每演出大约需要 26 分钟。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM