简体   繁体   中英

Weird character conversion, need help clarifying

So i am writing a program that takes data extracted to an excel sheet from a web page and then prints it to a text note. However, a weird problem that I have encountered is that from the excel sheet to the text note, a character has changed. the - has turned into a ? . My solution was to iterate through the word and when it gets to the ? and change it to a - . I've tried using unicodes that I've found online and done a

.replace("(question mark unicode) ", " - ") 

to no avail. Does anyone have any idea as to why it is doing that? and can you confirm the unicodes for ? and - . So if the word was "Leo‑III 1.3" it is now "Leo?III 1.3" Thank you for any help

replace in java takes character as the first argument and replaces all occurrences by the 2nd argument.

you can rather use this:

String newStr = str.replaceAll("\\?", "-");

the replaceAll takes 1st argument as a regex and replaces all matches by 2nd argument.

Note: \\ help escape ?

also, be sure to store the result in a new String variable as strings are immutable.

According to Java docs, the String class's replace method takes either a primitive char , or object implementing CharSequence (like String) as its parameters.

If you want to convert Leo?III 1.3 to Leo-III 1.3, use:

.replace("?", "-")

The ? is a result of character set encoding issues, and can occur at many places in the data pipeline.

It could actually be in the printing of the string, and the string itself is valid.

To find out what the actual character value is, try running this code to print the Unicode escape for all non-ASCII characters found in the string:

public static void printNonAscii(String s) {
    TreeSet<Character> nonAscii = new TreeSet<>();
    for (char ch : s.replaceAll("[\r\n\\x20-\\x7E]", "").toCharArray())
        nonAscii.add(ch);
    for (char ch : nonAscii)
        System.out.printf("\\u%04X  %s%n", (int) ch, ch);
}

Test (source in UTF-8)

printNonAscii("Foo ? \uFFFD ç ñ © ¼");

Output

\u00A9  ©
\u00BC  ¼
\u00E7  ç
\u00F1  ñ
\uFFFD  �

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM