简体   繁体   English

我如何从 PowerShell 中的字符串 substring 中获取特定数量的字节?

[英]How can I substring a specific number of bytes from a string in PowerShell?

I have a scenario where I need to obtain an installer embedded within a JSON REST response that is base64-encoded.我有一个场景,我需要获得一个嵌入在 JSON REST 响应中的安装程序,该响应是 base64 编码的。 Since the size of the JSON string is rather large (180 MB), it causes problems when decoding the REST response using standard PowerShell tooling as it causes OutOfMemoryException to be thrown quite often in limited memory scenarios (such as hitting WinRM memory quotas).由于 JSON 字符串的大小相当大 (180 MB),因此在使用标准 PowerShell 工具解码 REST 响应时会导致问题,因为它会导致在有限的 memory 场景中经常抛出OutOfMemoryException (例如达到 WinRM memory 配额)。

It's not desirable to raise the memory quota in our environment over a single installation, and we don't have standard tooling to prepare a package whose payload does not exist at a simple HTTP endpoint (I don't have direct permissions to publish packages not performed through our build system).通过单个安装提高我们环境中的 memory 配额是不可取的,而且我们没有标准工具来准备一个 package,其有效负载不存在于简单的 HTTP 端点(我没有直接发布包的权限通过我们的构建系统执行)。 My solution in this case is to decode the base64 string in chunks.在这种情况下,我的解决方案是分块解码 base64 字符串。 However, while I have this working, I am stuck on one last bit of optimization for this process.然而,虽然我有这个工作,但我仍然停留在这个过程的最后一点优化上。


Currently I am using a MemoryStream to read from the string, but I need to provide a byte[] :目前我正在使用MemoryStream从字符串中读取,但我需要提供一个byte[]

# $Base64String is a [ref] type
$memStream = [IO.MemoryStream]::new([Text.Encoding]::UTF8.GetBytes($Base64String.Value))

This unsurprisingly results in copying the byte[] representation of the entire base64-encoded string, and is even less memory-efficient than built-in tooling in its current form.不出所料,这会导致复制整个 base64 编码字符串的byte[]表示,并且内存效率甚至低于当前形式的内置工具。 The code you don't see here reads from $memStream in chunks of 1024 bytes at a time, decoding the base64 string and writing the bytes to disk using BinaryWriter .您在此处看不到的代码一次从$memStream读取1024字节的块,解码 base64 字符串并使用BinaryWriter将字节写入磁盘。 This all works well, if slow since I'm forcing garbage collection fairly often.这一切都很好,如果很慢,因为我经常强制垃圾收集。 However, I want to extend this byte-counting to the initial MemoryStream and only read n bytes from the string at a time.但是,我想将此字节计数扩展到初始MemoryStream并且一次只从字符串中读取n个字节。 My understanding is that base64 strings must be decoded in chunks of bytes divisible by 4.我的理解是 base64 字符串必须以可被 4 整除的字节块进行解码。

The problem is that [string].Substring([int], [int]) works based on string length , not number of bytes per character .问题是[string].Substring([int], [int])的工作基于 string length ,而不是number of bytes per character The JSON response can be assumed to be UTF-8 encoded, but even with this assumption UTF-8 characters vary between 1-4 bytes in length.可以假设 JSON 响应是 UTF-8 编码的,但即使采用这种假设,UTF-8 字符的长度也有 1-4 个字节的变化。 How can I (directly or indirectly) substring a specific number of bytes in PowerShell so I can create the MemoryStream from this substring instead of the full $Base64String ?我如何(直接或间接)substring PowerShell 中的特定字节数,以便我可以从这个 substring 而不是完整的$Base64String创建MemoryStream

I will note that I have explored the use of the [Text.Encoding].GetBytes([string], [int], [int]) overload , however, I face the same issue in that the method expects a character count , not byte count , for the length of the string to get the byte[] for from the starting index.我会注意到我已经探索了[Text.Encoding].GetBytes([string], [int], [int])重载的使用,但是,我面临同样的问题,因为该方法需要一个字符数,而不是byte count ,用于从起始索引获取byte[]的字符串的长度。

To answer the base question "How can I substring a specific number of bytes from a string in PowerShell", I was able to write the following function:为了回答基本问题“我如何 substring 从 PowerShell 中的字符串中获取特定数量的字节”,我编写了以下 function:

function Get-SubstringByByteCount {
  [CmdletBinding()]
  Param(
    [Parameter(Mandatory)]
    [ValidateScript({ $null -ne $_ -and $_.Value -is [string] })]
    [ref]$InputString,
    [int]$FromIndex = 0,
    [Parameter(Mandatory)]
    [int]$ByteCount,
    [ValidateScript({ [Text.Encoding]::$_ })]
    [string]$Encoding = 'UTF8'
  )
  
  [long]$byteCounter = 0
  [System.Text.StringBuilder]$sb = New-Object System.Text.StringBuilder $ByteCount

  try {
    while ( $byteCounter -lt $ByteCount -and $i -lt $InputString.Value.Length ) {
      [char]$char = $InputString.Value[$i++]
      [void]$sb.Append($char)
      $byteCounter += [Text.Encoding]::$Encoding.GetByteCount($char)
    }

    $sb.ToString()
  } finally {
    if( $sb ) {
      $sb = $null
      [System.GC]::Collect()
    }
  }
}

Invocation works like so:调用的工作方式如下:

Get-SubstringByByteCount -InputString ( [ref]$someString ) -ByteCount 8

Some notes on this implementation:关于此实现的一些注意事项:

  • Takes the string as a [ref] type since the original goal was to avoid copying the full string in a limited-memory scenario.将字符串作为[ref]类型,因为最初的目标是避免在内存有限的情况下复制完整的字符串。 This function could be re-implemented using the [string] type instead.这个 function 可以使用[string]类型重新实现。
  • This function essentially adds each character to a StringBuilder until the specified number of bytes has been written.这个 function 本质上是将每个字符添加到StringBuilder中,直到指定的字节数被写入。
  • The number of bytes of each character is determined by using one of the [Text.Encoding]::GetByteCount overloads.每个字符的字节数通过使用[Text.Encoding]::GetByteCount重载之一确定。 Encoding can be specified via a parameter, but the encoding value should match one of the static encoding properties available from [Text.Encoding] .可以通过参数指定编码,但编码值应与[Text.Encoding]提供的 static 编码属性之一相匹配。 Defaults to UTF8 as written.默认为写入的UTF8
  • $sb = $null and [System.GC]::Collect() are intended to forcibly clean up the StringBuilder in a memory-constrained environment, but could be omitted if this is not a concern. $sb = $null[System.GC]::Collect()旨在在内存受限的环境中强制清理StringBuilder ,但如果这不是问题,则可以省略。
  • -FromIndex takes the start position within -InputString to begin the substring operation from. -FromIndex从 -InputString 中的-InputString开始,从 substring 开始操作。 Defaults to 0 to evaluate from the start of the -InputString .默认为0以从-InputString的开头进行评估。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM