简体繁体 English

在Java中，如何有效地将数据从String复制到char[]/byte[]？

[英]In Java, how to copy data from String to char[]/byte[] efficiently?

原文 2020-08-25 09:23:12 3 4 java/ performance

I need to copy many big and different String str s' content to a static char array and use the array frequently in a efficiency-demanding job, thus it's important to avoid allocating too much new space.我需要将许多大而不同的String str的内容复制到静态 char 数组中，并在要求效率的工作中频繁使用该数组，因此避免分配太多新空间很重要。

For the reason above, str.toCharArray() was banned, since it allocates space for every String.由于上述原因， str.toCharArray()被禁止，因为它为每个字符串分配空间。

As we all know, charAt(i) is more slowly and more complex than using square brackets [i] .众所周知， charAt(i)比使用方括号[i]更慢也更复杂。 So I want to use byte[] or char[] .所以我想使用byte[]或char[] 。

One good news is, there's a str.getBytes(srcBegin, srcEnd, dst, dstBegin) .一个好消息是，有一个str.getBytes(srcBegin, srcEnd, dst, dstBegin) 。 But the bad news is it was (or is to be?) deprecated.但坏消息是它已被（或将要被）弃用。

So how can we finish this demanding job?那么我们如何才能完成这项艰巨的工作呢？

4 个解决方案

I believe you want getChars(int, int, char[], int) .我相信你想要getChars(int, int, char[], int) 。 That will copy the characters into the specified array, and I'd expect it to do it "as efficiently as reasonably possible".这会将字符复制到指定的数组中，我希望它“尽可能有效地”执行此操作。

You should avoid converting between text and binary representations unless you really need to.除非确实需要，否则应避免在文本和二进制表示之间进行转换。 Aside from anything else, that conversion itself is likely to be time-consuming.除此之外，这种转换本身可能很耗时。

A small stocktaking:一个小盘点：

String does Unicode text; String做 Unicode 文本； it can be normalized ( java.text.Normalizer ).它可以被标准化（ java.text.Normalizer ）。
int[] code points are Unicode symbols int[]代码点是 Unicode 符号
char[] is Unicode UTF-16BE (2 bytes per char), sometimes for a code point 2 chars are needed: a surrogate pair . char[]是 Unicode UTF-16BE（每个字符 2 个字节），有时代码点需要 2 个字符：代理对。
byte[] is for binary data. byte[]用于二进制数据。 Holding Unicode text in UTF-8 is relative compact when there is much ASCII resp.当有很多 ASCII 响应时，以UTF-8 保存Unicode 文本是相对紧凑的。 Latin-1.拉丁语-1。

Processing might be done on a ByteBuffer, CharBuffer, IntBuffer.处理可能在 ByteBuffer、CharBuffer、IntBuffer 上完成。

When dealing with Asian scripts, int code points probably is most feasible.在处理亚洲脚本时，int 代码点可能是最可行的。 Otherwise bytes seem best.否则字节似乎最好。

Code points (or chars) also make sense when the Character class is utilized for classification of Unicode blocks and scripts, digits in several scripts, emoji, whatever.当字符类用于对 Unicode 块和脚本、多个脚本中的数字、表情符号等进行分类时，代码点（或字符）也很有意义。

Performance would best be done in bytes as often most compact.性能最好以字节为单位完成，因为通常最紧凑。 UTF-8 probably. UTF-8 可能。

One cannot efficiently deal with memory allocation.不能有效地处理内存分配。 getBytes should be used with a Charset. getBytes应该与 Charset 一起使用。 Almost always a kind of conversion happens.几乎总是会发生一种转换。 As new java versions can keep a byte array instead of a char array for an encoding like Latin-1, ISO-8859-1, even using an internal char array would not do.由于新的 Java 版本可以为诸如 Latin-1、ISO-8859-1 之类的编码保留字节数组而不是字符数组，因此即使使用内部字符数组也行不通。 And new arrays are created.并创建新数组。

What one can do, is using fast ByteBuffers.人们可以做的是使用快速的 ByteBuffers。

Alternatively for lingual analysis one can use databases , maybe graph databases.另一种语言分析可以使用数据库，也许是图形数据库。 At least something which can exploit parallelism.至少可以利用并行性的东西。

You are pretty much restricted to the APIs offered within the string class, and obviously, that deprecated method is supposed to be replaced with getBytes() (or an alternative that allows to specify a charset.您几乎只能使用 string 类中提供的 API，显然，该弃用方法应该替换为getBytes() （或允许指定字符集的替代方法getBytes() 。

In other words: that problem you are talking about "having many large strings, that need to go into arrays" can't be solved easily.换句话说：你所说的“有很多大字符串，需要进入数组”的问题不容易解决。

Thus a distinct non-answer: look into your design.因此，一个明显的非答案是：看看你的设计。 If performance is really critical, then do not create those many large strings upfront !如果性能真的很重要，那么不要预先创建那么多大字符串！

In other words: if your measurements convince you that you do have real performance issue, then adapt your design as needed.换句话说：如果你的测量说服你，你有真正的性能问题，然后根据需要调整您的设计。 Maybe there is a chance that in the place where your strings are "coming" in ... you already do not use String objects, but something that works better for you, later on, performance wise.也许有可能在你的字符串“进来”的地方......你已经没有使用 String 对象，但是一些更适合你的东西，稍后，性能明智的。

But of course: that will lead to a complex, error prone solution, where you do a lot of "memory management" yourself.但当然：这将导致一个复杂的、容易出错的解决方案，你自己做很多“内存管理”。 Thus, as said: measure first.因此，正如所说：先测量。 Ensure that you have a real problem, and it actually sits in the place you think it sits.确保您遇到了真正的问题，并且它确实位于您认为它所在的位置。

str.getBytes(srcBegin, srcEnd, dst, dstBegin) is indeed deprecated. str.getBytes(srcBegin, srcEnd, dst, dstBegin)确实已弃用。 Therelevant documentation recommends getBytes() instead. 相关文档建议改为使用getBytes() 。 If you needed str.getBytes(srcBegin, srcEnd, dst, dstBegin) because sometimes you don't have to convert the entire string I suppose you could substring() first, but I'm not sure how badly that would impact your code's efficiency, if at all.如果您需要str.getBytes(srcBegin, srcEnd, dst, dstBegin)因为有时您不必转换整个字符串我想您可以先substring() ，但我不确定这会对您的代码效率产生多大影响，如果有的话。 Or if it's all the same to you if you store it in char[] then you can use getChars(int,int,char[],int) which is not deprecated.或者，如果将它存储在char[]对您来说都是一样的，那么您可以使用未弃用的getChars(int,int,char[],int) 。