读取Unicode字符代码JAVA

Question

嗨，我正在读取包含以下行的文件（请使用链接查看文件）：

U+0000
U+0001
U+0002
U+0003
U+0004
U+0005

使用此代码

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;

 public class fgenerator {    
public static void main(String[] args) {
    try(BufferedReader br  = new BufferedReader(new FileReader(new File("C:\\UNCDUNCD.txt")))){
        String line;
        String[] splited;
        while ((line = br.readLine()) != null){
            splited = line.split(" ");
            System.out.println(splited[0]);
        }
    }catch(Exception e) {
        e.printStackTrace();
    }

}

}

但是输出是

U+D01C
U+D01D
U+D01E
U+D01F
U+D020
U+D021

为什么会这样？
如何获取其代码的字符

Answer 1

将行数据类型更改为char ，如果不起作用，则String.getBytes()

Answer 2

我假设您要获取文件每一行上的Unicode 表示并输出代码表示的实际Unicode 字符。

如果我们从读取文件中每一行的循环开始...

while ((line = br.readLine()) != null){             
    System.out.println( line );
}

...然后我们要做的就是将输入line转换为字符，然后打印出来 ...

while ((line = br.readLine()) != null){             
    System.out.println( convert(line) ); <- I just put a method call to "convert()"
}

那么，如何在打印之前convert(line)转换为字符？
正如我之前的评论所建议的那样，您希望采用U+后面的数字字符串并将其转换为实际数字值。 那就是您要打印的字符值。

以下是一个完整的程序-本质上与您的程序相同，但我将文件名作为参数而不是对其进行硬编码。 我还添加了跳过空行和拒绝无效字符串的方法-打印空白。

如果该行与Unicode表示形式的U+nnnn不匹配，请拒绝该行—与"(?i)U\\\\+[0-9A-F]{4}"匹配，这意味着：
(?i) -忽略大小写
U\\\\+ -匹配U+ ，其中+必须转义为文字加号
[0-9A-F] -匹配任何字符0-9或AF（忽略大小写）
{4} -正好4次

对于包含链接示例文件（其中包含#注释）的更新，我已经修改了原始程序（如下），因此它将删除注释，然后转换剩余的表示形式。

这是一个完整的程序，可以运行为：
javac Reader2.java
java Reader2 inputfile.txt

我用文件的子集对其进行了测试，从inputfile.txt在第1行以U+0000开始，在312行以U+0138

import java.io.*;

public class Reader2
{
    public static void main(String... args)
    {
        final String filename = args[0];
        try (BufferedReader br = new BufferedReader(
                                    new FileReader(new File( filename ))
                                 )
            )
        {
            String line;
            while ((line = br.readLine()) != null) {
                if (line.trim().length() > 0) { // skip blank lines
                  //System.out.println( convert(line) );
                  final Character c = convert(line);
                  if (Character.isValidCodePoint(c)) {
                        System.out.print  ( c );
                  }
                }
            }
            System.out.println();
        }
        catch(Exception e) {
            e.printStackTrace();
        }
    }

    private static char convert(final String input)
    {
        //System.out.println("Working on line: " + input);
        if (! input.matches("(?i)U\\+[0-9A-F]{4}(\\s+#.*)")) {
            System.err.println("Rejecting line: " + input);
            return ' ';
        }
        else {
            //System.out.println("Accepting line: " + input);
        }
        // else
        final String stripped = input.replaceFirst("\\s+#.*$", "");
        final Integer cval = Integer.parseInt(stripped.substring(2), 16);
        //System.out.println("cval = " + cval);
        return (char) cval.intValue();
    }
}

假定只包含U+nnnn的行的原始程序在这里。

您可以这样运行：
javac Reader.java
java Reader input.txt

import java.io.*;

public class Reader
{
    public static void main(String... args)
    {
        final String filename = args[0];
        try (BufferedReader br = new BufferedReader(
                                    new FileReader(new File( filename ))
                                 )
            )
        {
            String line;
            while ((line = br.readLine()) != null) {
                if (line.trim().length() > 0) { // skip blank lines
                  //System.out.println( line );
                    // Write all chars on one line rather than one char per line
                    System.out.print  ( convert(line) );
                }
            }
            System.out.println(); // Print a newline after all chars are printed
        }
        catch(Exception e) {      // don't catch plain `Exception` IRL
            e.printStackTrace();  // don't just print a stack trace IRL
        }
    }

    private static char convert(final String input)
    {
        // Reject any line that doesn't match U+nnnn
        if (! input.matches("(?i)U\\+[0-9A-F]{4}")) {
            System.err.println("Rejecting line: " + input);
            return ' ';
        }
        // else convert the line to the character
        final Integer cval = Integer.parseInt(input.substring(2), 16);
        //System.out.println("cval = " + cval);
        return (char) cval.intValue();
    }
}

使用以下内容作为输入文件尝试一下：

U+0041
bad line
U+2718
U+00E9
u+0073

在运行它时重定向标准错误java Reader input.txt 2> /dev/null或注释掉System.err.println...行System.err.println...
您应该得到以下输出： A✘és

读取Unicode字符代码JAVA

问题描述

2 个解决方案

解决方案1
0 2018-06-08 21:36:02

解决方案2
0 2018-06-11 21:25:06

读取Unicode字符代码JAVA

问题描述

2 个解决方案

解决方案1 0 2018-06-08 21:36:02

解决方案2 0 2018-06-11 21:25:06

解决方案1
0 2018-06-08 21:36:02

解决方案2
0 2018-06-11 21:25:06