编码问题； .jar不适用于UTF-8文件中的西里尔字母

Question

So I have this regex as String literal in my code: 所以我的代码中有此正则表达式为String文字：

private static final String FILE_PATTERN = "((\\s*\".*НЕКОТОРЫЕ СИМВОЛЫ .*\"\\R)([^\"].* (?!-)\\d+\\s*)+)+";

Also I have input test files in UTF-8 encoding. 我也有UTF-8编码的输入测试文件。

And the problem is that when I test my program in IDE (IntelliJ IDEA in my case) everything is OK. 问题是当我在IDE中测试程序时（我的情况是IntelliJ IDEA），一切正常。 Particularly, regex works with Cyrillic characters in test files. 尤其是，正则表达式可在测试文件中使用西里尔字母。

But when I build my program (Maven) and tested .jar file with the same test files, it turned out that most likely regex won't work with Cyrillic characters. 但是，当我构建程序（Maven）并使用相同的测试文件测试.jar文件时，事实证明，最有可能的正则表达式不适用于西里尔字母。

Then I tested it again with file in Windows 1251 encoding and it worked. 然后，我再次使用Windows 1251编码的文件对其进行了测试，并且可以正常工作。

So my question is - how can I make my .jar work with UTF-8 files, just like in IDE? 所以我的问题是-如何使我的.jar与UTF-8文件一起使用，就像在IDE中一样？

Thanks in advance. 提前致谢。

[UPDATE1] [UPDATE1]

two test files, one in UTF-8 and another in Windows 1251 两个测试文件，一个在UTF-8中，另一个在Windows 1251中

I've tried to replace Cyrillic characters with \\u codes like this: 我试图用\\ u代码替换西里尔字母，如下所示：

private static final String FILE_PATTERN = "((\\s*\".*\\u041E\\u0442\\u0434\\u0435\\u043B .*\"\\R)([^\"].* (?!-)\\d+\\s*)+)+";

this doesn't work :( 这不起作用:(

[UPDATE2] [UPDATE2]

File processing starts like this: 文件处理开始如下：

static void processFile(String inputFile) {
    try {
        String fileStr = FileHandler.readFile(inputFile).toString();
        if (!FileParser.validateFile(fileStr)) {
            System.out.println("Sorry, input file format is invalid");
            ...

File validating looks like this: 文件验证如下所示：

public class FileParser {
private static final String FILE_PATTERN = "((\\s*\".*Отдел .*\"\\R)([^\"].* (?!-)\\d+\\s*)+)+";

public static boolean validateFile(String fileStr) {
    return Pattern.compile(FILE_PATTERN).matcher(fileStr).matches();
}
...

File reading is very common I think: 我认为文件读取非常普遍：

public class FileHandler {
public static StringBuilder readFile(String fileName) {
    StringBuilder res = new StringBuilder();
    String temp;
    try (BufferedReader r = new BufferedReader(new FileReader((fileName)))) {
        while ((temp = r.readLine()) != null) {
            res.append(temp).append("\n");
        }
    } catch (FileNotFoundException e) { 
        System.out.println("Input file not found!");
    } catch (IOException e) {
        // log exception
    }
    return res;
}
...

Answer 1

I'll throw some possibilities at the problem. 我会解决这个问题。

The classes FileReader and FileWriter use the default platform encoding, without overload for a specified encoding. FileReader和FileWriter类使用默认平台编码，指定的编码没有重载。 I am not sure whether this is intended, but one of the alternatives: 我不确定这是否有意，但是可以选择以下一种方法：

public static StringBuilder readFile(String fileName) {
    StringBuilder res = new StringBuilder();
    String temp;
    Charset charset = StandardCharsets.UTF_8;
    //Charset charset = Charset.fromName("Windows-1251");
    try (BufferedReader r = Files.newBufferedReader(fileName, charset)) {
        while ((temp = r.readLine()) != null) {
            res.append(temp).append("\n");
        }
    } catch (FileNotFoundException e) { 
        System.out.println("Input file not found!");
    } catch (IOException e) {
        // log exception
    }
    return res;
}

Or: 要么：

String readFile(String fileName) throws IOException {
    byte[] content = Files.readAllBytes(Paths.get(fileName));
    return new String(content, StandardCharsets.UTF_8);
}

Then the editor encoding of the java sources must be the same encoding as that of the javac compiler . 然后，java源的编辑器编码必须与javac编译器的编码相同。 One can check this by using the \\uXXXX ASCII representation of such special chars: if it then suddenly works, ... 可以使用\\uXXXX此类特殊字符的ASCII表示来检查这一点：如果随后突然起作用，...

You used two backslashes, but \c (letter c ) works java source level, and in fact instead of public class you can write publi\c \class . 您使用了两个反斜杠，但是\c （字母c ）可以在Java源代码级别工作，实际上，您可以编写publi\c \class来代替public class 。

private static final String FILE_PATTERN =
    "((\\s*\".*\u041E\u0442\u0434\u0435\u043B .*\"\\R)([^\"].* (?!-)\\d+\\s*)+)+";

Then there is the regular expression, that has two Unicode flags, (?u) and (?U) undermore for what a letter constitutes. 然后是一个正则表达式，其中有两个Unicode标志，分别表示字母的组成(?u)和(?U) 。 That should not be a problem here. 在这里这不是问题。

编码问题； .jar不适用于UTF-8文件中的西里尔字母

问题描述

1 个解决方案

解决方案1
1 2017-12-15 13:13:11

编码问题； .jar不适用于UTF-8文件中的西里尔字母

问题描述

1 个解决方案

解决方案1 1 2017-12-15 13:13:11

解决方案1
1 2017-12-15 13:13:11