简体   繁体   中英

string.replaceAll cutting characters 50% of the time

I'm trying to use a series of string.replaceAlls to swap all the UTF-8 special characters in a text file with ASCII & HTML encoding. Along the way I've hit a particularly stubborn one: \겋, the UTF-8 middot.

Here's the line that cuts out the character, half the time:

  string_out = string_out.replaceAll("•", "·");

("¬ï" is how a UTF-8 · appears as extended ASCII. Before stumbling on this line, I'd tried "\겋" and many other encodings without success.)

The line cuts out the UTF-8 middot, it doesn't replace it, and it does that only half the time. The other half the time it misses the character, and leaves it unchanged. If I make multiple copies of it or move other lines around it, it doesn't even do that.

This feels like a multithreading issue, but I'm not aware of any multithreading going on. Just a block of replaceAlls in a included .jsp file being run from another .jsp.

What could cause this race-condition like behavior?

AC8B is not a dot, it's a Chinese character. Did you mean 00B7?

Java strings are always UTF-16 Unicode. UTF-8 is a way of representing Unicode characters in a file, it is not the way Java strings are stored in memory.

Pay attention to the encoding used to read the input and write the output files, they should be UTF-8, but once the file contents have been read into a Java string, it won't be UTF-8 anymore, but 16-bit Unicode.

I think your best chance is using the correct Unicode escape, not trying to represent UTF-8 raw bytes as ASCII.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM