简体   繁体   English

string.replace 50%的时间中所有剪切字符

[英]string.replaceAll cutting characters 50% of the time

I'm trying to use a series of string.replaceAlls to swap all the UTF-8 special characters in a text file with ASCII & HTML encoding. 我正在尝试使用一系列string.replaceAlls来以ASCII和HTML编码交换文本文件中的所有UTF-8特殊字符。 Along the way I've hit a particularly stubborn one: \겋, the UTF-8 middot. 一路走来,我遇到了一个特别顽固的人:\\ uAC8B,UTF-8中间点。

Here's the line that cuts out the character, half the time: 这是截断字符的那一行,一半的时间:

  string_out = string_out.replaceAll("•", "·");

("¬ï" is how a UTF-8 · appears as extended ASCII. Before stumbling on this line, I'd tried "\겋" and many other encodings without success.) (“¬ï”是UTF-8·作为扩展ASCII出现的方式。在绊到这行之前,我尝试过“ \\ uAC8B”和许多其他编码,但均未成功。)

The line cuts out the UTF-8 middot, it doesn't replace it, and it does that only half the time. 该行剪切出了UTF-8中间点,它没有替代它,并且只完成了一半的时间。 The other half the time it misses the character, and leaves it unchanged. 另一半时间它错过了角色,并且保持不变。 If I make multiple copies of it or move other lines around it, it doesn't even do that. 如果我对其进行多份复制或在其周围移动其他线条,它甚至都不会这样做。

This feels like a multithreading issue, but I'm not aware of any multithreading going on. 这感觉像是一个多线程问题,但是我不知道正在发生任何多线程问题。 Just a block of replaceAlls in a included .jsp file being run from another .jsp. 从另一个.jsp运行的包含的.jsp文件中,只有replaceAlls的一部分。

What could cause this race-condition like behavior? 是什么会导致这种种族条件(例如行为)?

AC8B is not a dot, it's a Chinese character. AC8B不是点,而是汉字。 Did you mean 00B7? 您是说00B7吗?

Java strings are always UTF-16 Unicode. Java字符串始终为UTF-16 Unicode。 UTF-8 is a way of representing Unicode characters in a file, it is not the way Java strings are stored in memory. UTF-8是表示文件中Unicode字符的一种方式,而不是Java字符串存储在内存中的方式。

Pay attention to the encoding used to read the input and write the output files, they should be UTF-8, but once the file contents have been read into a Java string, it won't be UTF-8 anymore, but 16-bit Unicode. 请注意用于读取输入和写入输出文件的编码,它们应为UTF-8,但是一旦将文件内容读入Java字符串,它将不再是UTF-8,而是16位Unicode格式。

I think your best chance is using the correct Unicode escape, not trying to represent UTF-8 raw bytes as ASCII. 我认为您最好的机会是使用正确的Unicode转义,而不是尝试将UTF-8原始字节表示为ASCII。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM