简体   繁体   English

解析带有引用字段的字符串,例如 Powershell 中的 CSV 行

[英]Parsing a String with quoted Fields like a CSV-line in Powershell

I have to parse a variable input-string into a string-array.我必须将变量输入字符串解析为字符串数组。 The input is a CSV-style comma-separated field-list where each field has its own quoted string.输入是一个 CSV 样式的逗号分隔的字段列表,其中每个字段都有自己的带引号的字符串。 Because I dont want to write my own full-blown CSV-parser the only working solution I could create till now is this one:因为我不想编写自己的成熟 CSV 解析器,所以到目前为止我可以创建的唯一可行的解决方案是这个:

$input = '"Miller, Steve", "Zappa, Frank", "Johnson, Earvin ""Magic"""'

Add-Type -AssemblyName Microsoft.VisualBasic
$enc = [System.Text.Encoding]::UTF8
$bytes = $enc.GetBytes($input)
$stream = [System.IO.MemoryStream]::new($bytes)
$parser = [Microsoft.VisualBasic.FileIO.TextFieldParser]::new($stream)
$parser.Delimiters = ','
$parser.HasFieldsEnclosedInQuotes = $true
$list = $parser.ReadFields()

$list

Output looks like this: Output 看起来像这样:

Miller, Steve
Zappa, Frank
Johnson, Earvin "Magic"

Is there any better solution available via another .NET-library for Powersell? Powersell 的另一个 .NET 库是否有更好的解决方案? In best case I could avoid this extra bytes-array and stream.在最好的情况下,我可以避免这个额外的字节数组和 stream。 I am also not sure if this VisualBasic-Assembly will be avail on a long term.我也不确定这个 VisualBasic-Assembly 是否会长期使用。

Any ideas here?这里有什么想法吗?

With some extra precautions for security and to prevent inadvertent string extrapolation, you can combine Invoke-Expression with Write-Output , though note that Invoke-Expression should generally be avoided :通过一些额外的安全预防措施和防止无意的字符串外推,您可以将Invoke-ExpressionWrite-Output结合使用,但请注意通常应避免使用Invoke-Expression

$fieldList = '"Miller, Steve", "Zappa, Frank", "Johnson, Earvin ""Magic""", "Honey, I''m $HOME"'

# Parse into array.
$fields = (
  Invoke-Expression ("Write-Output " + ($fieldList -replace '\$', "`0"))
) -replace "`0", '$$'

Note:笔记:

  • -replace '\$', "`0" temporarily replaces literal $ chars. -replace '\$', "`0"临时替换文字$字符。 in the input with NUL chars.在带有 NUL 字符的输入中。 to prevent accidental (or malicious) string expansion (interpolation) ;防止意外(或恶意)字符串扩展(插值) the second -replace operation restores the original $ chars.第二个-replace操作恢复原来的$字符。
    See this answer for more information about the regex-based -replace operator.有关基于正则表达式的-replace运算符的更多信息,请参阅此答案

  • If an only if the input string is guaranteed to never contain embedded $ characters , the solution can be simplified to:如果仅当保证输入字符串永远不会包含嵌入的$字符时,解决方案可以简化为:

     $fields = Invoke-Expression "Write-Output $fieldList"

Outputting $fields yields the following:输出$fields会产生以下结果:

Miller, Steve
Zappa, Frank
Johnson, Earvin "Magic"
Honey, I'm $HOME

Explanation and list of constraints :约束说明和列表

The solution relies on making the input string part of a string whose content is a syntactically valid Write-Output call, with the input string serving as the latter's arguments .该解决方案依赖于使输入字符串成为其内容是语法上有效Write-Output调用的字符串的一部分,输入字符串用作后者的arguments Invoke-Expression then evaluates this string as if its content had directly been submitted as a command and therefore executes the Write-Output command. Invoke-Expression然后评估这个字符串,就好像它的内容直接作为命令提交一样,因此执行Write-Output命令。 Based on how PowerShell parses command arguments, this implies the following constraints:根据 PowerShell 如何解析命令 arguments,这意味着以下约束:

  • Supported field separators:支持的字段分隔符:

    • Either: , -separated (with per-field (unquoted) leading and/or trailing whitespace getting removed, as shown above).要么: , -separated (每个字段(未加引号)的前导和/或尾随空格被删除,如上所示)。

    • Or: whitespace-separated , using one or more whitespace characters between the fields.或者: whitespace-separated ,在字段之间使用一个或多个空格字符。

  • Non-/quoting of embedded fields :嵌入字段的非/引用

    • Fields can be quoted :可以引用字段:

      • If single-quoted ( '...' ), field- internal ' characters must be escaped as '' .如果使用单引号( '...' ),则字段内部'字符必须转义为''

      • If double-quoted , field- internal " characters must be escaped as either "" or `" .如果用双引号括起来,则 field- internal "字符必须转义为""`"

    • Fields can also be unquoted :字段也可以不加引号

      • However, such fields mustn't contain any PowerShell argument-mode metacharacters (of these, < > @ # are only metacharacters at the start of a token):但是,此类字段不得包含任何 PowerShell 参数模式元字符(其中, < > @ #只是标记开头的元字符):

         <space> ' " `, ; ( ) { } | & < > @ #

Alternative, via ConvertFrom-Csv :替代方案,通过ConvertFrom-Csv

iRon's helpful answer shows a solution based on ConvertFrom-Csv , given that the field list embedded in the input string is comma -separated ( , ): iRon 的有用答案显示了基于ConvertFrom-Csv的解决方案,假设输入字符串中嵌入的字段列表是逗号分隔的 ( , ):

  • On the one hand, it is more limited in that it only supports "..." -quoting of fields and "" -escaping of field-internal " , and doesn't support fields separated by varying amounts of whitespace (only).一方面,它更受限制,因为它只支持"..."引用字段和"" - 转义 field-internal " ,并且不支持由不同数量的空格分隔的字段(仅)。

  • On the other hand, it is more flexible, in that it supports any single-character separator between the fields (irrespective of incidental leading/trailing per-field whitespace), which can be specified via the -Delimiter parameter.另一方面,它更灵活,因为它支持字段之间的任何单字符分隔符(不考虑每个字段附带的前导/尾随空格),这可以通过-Delimiter参数指定。

What makes the solution awkward is the need to anticipate the max.使解决方案尴尬的是需要预测最大值。 number of embedded fields and to provide dummy headers (column names) for them ( -Header (0..99) ) in order to make ConvertFrom-Csv work, which is both fragile and potentially wasteful.嵌入字段的数量并为它们提供虚拟标题(列名)( -Header (0..99) )以使ConvertFrom-Csv工作,这既脆弱又可能造成浪费。

However, a simple trick can bypass this problem: Submit the input string twice , in which case ConvertFrom-Csv treats the fields in the input string as both the column names and as the column values of the one and only output row (object), whose values can then be queried:然而,一个简单的技巧可以绕过这个问题:提交输入字符串两次,在这种情况下, ConvertFrom-Csv将输入字符串中的字段视为列名唯一的 output 行(对象)的列值,然后可以查询其值:

$fieldList = '"Miller, Steve", "Zappa, Frank", "Johnson, Earvin ""Magic""", "Honey, I''m $HOME"'

# Creates the same array as the solution at the top.
$fields = ($fieldList, $fieldList | ConvertFrom-Csv).psobject.Properties.Value

If the list is limited, you might use the parser of the ConvertFrom-Csv cmdlet, like:如果列表有限,您可以使用ConvertFrom-Csv cmdlet 的解析器,例如:

$List = '"Miller, Steve", "Zappa, Frank", "Johnson, Earvin ""Magic""", "Honey, I''m $HOME"'
($List | ConvertFrom-Csv -Header (0..99)).PSObject.Properties.Value.Where{ $Null -ne $_ }
Miller, Steve
Zappa, Frank
Johnson, Earvin "Magic"
Honey, I'm $HOME

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM