string.replace 50％的时间中所有剪切字符

Question

I'm trying to use a series of string.replaceAlls to swap all the UTF-8 special characters in a text file with ASCII & HTML encoding. 我正在尝试使用一系列string.replaceAlls来以ASCII和HTML编码交换文本文件中的所有UTF-8特殊字符。 Along the way I've hit a particularly stubborn one: \겋, the UTF-8 middot. 一路走来，我遇到了一个特别顽固的人：\\ uAC8B，UTF-8中间点。

Here's the line that cuts out the character, half the time: 这是截断字符的那一行，一半的时间：

  string_out = string_out.replaceAll("¬ï", "&amp;middot;");

("¬ï" is how a UTF-8 · appears as extended ASCII. Before stumbling on this line, I'd tried "\겋" and many other encodings without success.) （“¬ï”是UTF-8·作为扩展ASCII出现的方式。在绊到这行之前，我尝试过“ \\ uAC8B”和许多其他编码，但均未成功。）

The line cuts out the UTF-8 middot, it doesn't replace it, and it does that only half the time. 该行剪切出了UTF-8中间点，它没有替代它，并且只完成了一半的时间。 The other half the time it misses the character, and leaves it unchanged. 另一半时间它错过了角色，并且保持不变。 If I make multiple copies of it or move other lines around it, it doesn't even do that. 如果我对其进行多份复制或在其周围移动其他线条，它甚至都不会这样做。

This feels like a multithreading issue, but I'm not aware of any multithreading going on. 这感觉像是一个多线程问题，但是我不知道正在发生任何多线程问题。 Just a block of replaceAlls in a included .jsp file being run from another .jsp. 从另一个.jsp运行的包含的.jsp文件中，只有replaceAlls的一部分。

What could cause this race-condition like behavior? 是什么会导致这种种族条件（例如行为）？

Answer 1

AC8B is not a dot, it's a Chinese character. AC8B不是点，而是汉字。 Did you mean 00B7? 您是说00B7吗？

Java strings are always UTF-16 Unicode. Java字符串始终为UTF-16 Unicode。 UTF-8 is a way of representing Unicode characters in a file, it is not the way Java strings are stored in memory. UTF-8是表示文件中Unicode字符的一种方式，而不是Java字符串存储在内存中的方式。

Pay attention to the encoding used to read the input and write the output files, they should be UTF-8, but once the file contents have been read into a Java string, it won't be UTF-8 anymore, but 16-bit Unicode. 请注意用于读取输入和写入输出文件的编码，它们应为UTF-8，但是一旦将文件内容读入Java字符串，它将不再是UTF-8，而是16位Unicode格式。

I think your best chance is using the correct Unicode escape, not trying to represent UTF-8 raw bytes as ASCII. 我认为您最好的机会是使用正确的Unicode转义，而不是尝试将UTF-8原始字节表示为ASCII。

string.replace 50％的时间中所有剪切字符

问题描述

1 个解决方案

解决方案1
4 已采纳 2012-01-04 19:37:11

string.replace 50％的时间中所有剪切字符

问题描述

1 个解决方案

解决方案1 4 已采纳 2012-01-04 19:37:11

解决方案1
4 已采纳 2012-01-04 19:37:11