简体   繁体   中英

How to fix this (presumably) encoding-related error (Java, Gradle)?

I have the following method, which truncates a string to a certain size in bytes:

public class Utils {
    public static String trimStringToBytesSize(String s, int length) {
        if (s == null || length < 0) return null;
        int trimLength = Math.min(length, s.length());
        String trimmedString = s;
        while (trimmedString.getBytes().length > length && trimLength >= 0) {
            trimmedString = s.substring(0, trimLength);
            trimLength--;
        }
        return trimmedString;
    }
}

I wrote some tests for it:

@Test
public void trimStringToBytesSize() {
[...]
    trimStringToBytesSizeTestLogic("Шалом",
            6,
            "Шал"
    );
[...]
}

private void trimStringToBytesSizeTestLogic(final String input, final int
        stringLength, final String expectedResult) {
    final String actRes = Utils.trimStringToBytesSize(input, stringLength);
    Assert.assertEquals(expectedResult, actRes);
}

This test runs fine inside IntelliJ Idea. However, it fails when I run it in Gradle. The error is this:

org.junit.ComparisonFailure: expected:<Шал[]> but was:<Шал[ом]>

Obviously, it has something to do with the byte sizes.

I tried to reproduce the problem in a minimal project , which contains the method and the test. The code is the same, but the problem, which appears in the original code does not appear in this minimal project.

I tried to find out the difference between them and compared the encodings in the minimal and the original project. The are the same according to Notepad++ (UTF-8).

What else could cause this test failure? How can I fix it?

Notes: I'm using Java 1.8 and Gradle 2.14 (I can't upgrade to a more recent version due to the requirements of the customer).

You are right, the byte size of a string heavily depends on the encoding you use the generate the bytes from the string. As you use String.getBytes() without parameter, the default encoding is used. This is UTF-8 on *nix systems and `ISO-8859-1' on Windows systems.

Your string Шалом in UTF-8 bytes is [-48, -88, -48, -80, -48, -69, -48, -66, -48, -68] .
Your string Шалом in ISO-8859-1 bytes is [63, 63, 63, 63, 63] which effectively is ????? , because your characters cannot be encoded in ISO-8859-1 .

So when your test is successful you have UTF-8 as encoding, if it fails you have ISO-8859-1 as encoding where there are only 5 bytes and thus the string is not touched.

You should almost never use methods like String getBytes() or new String() without specifying an explicit encoding, or you always have different behavior on different OS or in different contexts.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM