简体   繁体   中英

In Java, how to copy data from String to char[]/byte[] efficiently?

I need to copy many big and different String str s' content to a static char array and use the array frequently in a efficiency-demanding job, thus it's important to avoid allocating too much new space.

For the reason above, str.toCharArray() was banned, since it allocates space for every String.

As we all know, charAt(i) is more slowly and more complex than using square brackets [i] . So I want to use byte[] or char[] .

One good news is, there's a str.getBytes(srcBegin, srcEnd, dst, dstBegin) . But the bad news is it was (or is to be?) deprecated.

So how can we finish this demanding job?

I believe you want getChars(int, int, char[], int) . That will copy the characters into the specified array, and I'd expect it to do it "as efficiently as reasonably possible".

You should avoid converting between text and binary representations unless you really need to. Aside from anything else, that conversion itself is likely to be time-consuming.

A small stocktaking:

  • String does Unicode text; it can be normalized ( java.text.Normalizer ).
  • int[] code points are Unicode symbols
  • char[] is Unicode UTF-16BE (2 bytes per char), sometimes for a code point 2 chars are needed: a surrogate pair .
  • byte[] is for binary data. Holding Unicode text in UTF-8 is relative compact when there is much ASCII resp. Latin-1.

Processing might be done on a ByteBuffer, CharBuffer, IntBuffer.

When dealing with Asian scripts, int code points probably is most feasible. Otherwise bytes seem best.

Code points (or chars) also make sense when the Character class is utilized for classification of Unicode blocks and scripts, digits in several scripts, emoji, whatever.

Performance would best be done in bytes as often most compact. UTF-8 probably.

One cannot efficiently deal with memory allocation. getBytes should be used with a Charset. Almost always a kind of conversion happens. As new java versions can keep a byte array instead of a char array for an encoding like Latin-1, ISO-8859-1, even using an internal char array would not do. And new arrays are created.

What one can do, is using fast ByteBuffers.

Alternatively for lingual analysis one can use databases , maybe graph databases. At least something which can exploit parallelism.

You are pretty much restricted to the APIs offered within the string class, and obviously, that deprecated method is supposed to be replaced with getBytes() (or an alternative that allows to specify a charset.

In other words: that problem you are talking about "having many large strings, that need to go into arrays" can't be solved easily.

Thus a distinct non-answer: look into your design. If performance is really critical, then do not create those many large strings upfront !

In other words: if your measurements convince you that you do have real performance issue, then adapt your design as needed. Maybe there is a chance that in the place where your strings are "coming" in ... you already do not use String objects, but something that works better for you, later on, performance wise.

But of course: that will lead to a complex, error prone solution, where you do a lot of "memory management" yourself. Thus, as said: measure first. Ensure that you have a real problem, and it actually sits in the place you think it sits.

str.getBytes(srcBegin, srcEnd, dst, dstBegin) is indeed deprecated. Therelevant documentation recommends getBytes() instead. If you needed str.getBytes(srcBegin, srcEnd, dst, dstBegin) because sometimes you don't have to convert the entire string I suppose you could substring() first, but I'm not sure how badly that would impact your code's efficiency, if at all. Or if it's all the same to you if you store it in char[] then you can use getChars(int,int,char[],int) which is not deprecated.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM