简体   繁体   English

C#对UTF-16字节数组执行字符串操作

[英]C# perform string operation on UTF-16 byte array

I'm reading a file into byte[] buffer . 我正在将文件读入byte[] buffer The file contains a lot of UTF-16 strings (millions) in the following format: 该文件包含许多UTF-16字符串(数百万),格式如下:

  • The first byte contain and string length in chars (range 0 .. 255) 第一个字节包含chars中的字符串长度(范围为0 .. 255)
  • The following bytes contains the string characters in UTF-16 encoding (each char represented by 2 bytes, means byteCount = charCount * 2). 以下字节包含UTF-16编码的字符串字符(每个字符由2个字节表示,表示byteCount = charCount * 2)。

I need to perform standard string operations for all strings in the file, for example: IndexOf , EndsWith and StartsWith , with StringComparison.OrdinalIgnoreCase and StringComparison.Ordinal . 我需要对文件中的所有字符串执行标准字符串操作,例如: IndexOfEndsWithStartsWith ,使用StringComparison.OrdinalIgnoreCaseStringComparison.Ordinal

For now my code first converting each string from byte array to System.String type. 现在我的代码首先将每个字符串从字节数组转换为System.String类型。 I found the following code to be the most efficient to do so: 我发现以下代码是最有效的:

// position/length validation removed to minimize the code

string result;
byte charLength = _buffer[_bufferI++];
int byteLength = charLength * 2;

fixed (byte* pBuffer = &_buffer[_bufferI])
{
    result = new string((char*)pBuffer, 0, charLength);
}

_bufferI += byteLength;
return result;

Still, new string(char*, int, int) it's very slow because it performing unnecessary copying for each string . 仍然, new string(char*, int, int)非常慢,因为它为每个字符串执行不必要的复制。

Profiler says its System.String.wstrcpy(char*,char*,int32) performing slow. Profiler说它的System.String.wstrcpy(char*,char*,int32)表现缓慢。

I need a way to perform string operations without copying bytes for each string . 我需要一种方法来执行字符串操作, 而无需为每个字符串复制字节

Is there a way to perform string operations on byte array directly ? 有没有办法直接对字节数组执行字符串操作?

Is there a way to create new string without copying its bytes? 有没有办法创建新的字符串而不复制其字节?

No, you can't create a string without copying the character data. 不,如果不复制字符数据,则无法创建字符串。

The String object stores the meta data for the string (Length, et.c.) in the same memory area as the character data, so you can't keep the character data in the byte array and pretend that it's a String object. String对象将字符串(Length,et.c。)的元数据存储在与字符数据相同的内存区域中,因此您无法将字符数据保留在字节数组中并假装它是String对象。

You could try other ways of constructing the string from the byte data, and see if any of them has less overhead, like Encoding.UTF16.GetString . 您可以尝试从字节数据构造字符串的其他方法,并查看它们中的任何一个是否具有较少的开销,如Encoding.UTF16.GetString

If you are using a pointer, you could try to get multiple strings at a time, so that you don't have to fix the buffer for each string. 如果使用指针,则可以尝试一次获取多个字符串,这样就不必为每个字符串修复缓冲区。

You could read the File using a StreamReader using Encoding.UTF16 so you do not have the "byte overhead" in between: 您可以使用Encoding.UTF16使用StreamReader读取文件,因此您之间没有“字节开销”:

using (StreamReader sr = new StreamReader(filename, Encoding.UTF16)) 
{
    string line;

    while ((line = sr.ReadLine()) != null) 
    {
        //Your Code
    }
}

You could create extension methods on byte arrays to handle most of those string operations directly on the byte array and avoid the cost of converting. 您可以在字节数组上创建扩展方法,以直接在字节数组上处理大多数字符串操作,从而避免转换成本。 Not sure what all string operations you perform, so not sure if all of them could be accomplished this way. 不确定你执行的所有字符串操作,所以不确定是否所有这些都可以通过这种方式完成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM