简体   繁体   English

在Java中输出和预读Unicode字符串

[英]Output and preg to Unicode Strings in Java

I have normal String property inside an object, containing accented characters. 我在对象中有正常的String属性,包含重音字符。 If I debug the software (with Netbeans), into the variables panel I will see that string in the right way: 如果我调试软件(使用Netbeans),进入变量面板我将以正确的方式看到该字符串:

调试模式下的变量窗口

But when I'm going to print out the variable with System.out.println I will see strange things: 但是当我要用System.out.println打印变量时,我会看到奇怪的东西:

输出窗口

As you can see every "à" become "a'" and so on, and this will lead to a wrong character count, even in Matcher on the string. 正如您所看到的,每个“à”变成“a”等等,这将导致错误的字符数,即使在字符串上的Matcher中也是如此。

How I can fix this? 我怎么解决这个问题? I need the accented characters, to have the right characters count and to use the matcher on it. 我需要重音字符,要有正确的字符数,并在其上使用匹配器。 I tried many ways but is not going to work, for sure I'm missing something. 我尝试了很多方法但是没有用,肯定我错过了什么。

Thanks in advance. 提前致谢。

EDIT 编辑

完整输出窗口视图

EDIT AGAIN 再次编辑

This is the code: 这是代码:

public class TextLine {
    public List<TextPosition> textPositions = null;
    public String text = "";
}

public class myStripper extends PDFTextStripper {

    public ArrayList<TextLine> lines = null;

    boolean startOfLine = true;

    public myStripper() throws IOException
    {
    }

    private void newLine() {
        startOfLine = true;
    }

    @Override
    protected void startPage(PDPage page) throws IOException
    {
        newLine();
        super.startPage(page);
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        newLine();
        super.writeLineSeparator();
    }

    @Override
    public String getText(PDDocument doc) throws IOException
    {
        lines = new ArrayList<TextLine>();
        return super.getText(doc);
    }

    @Override
    protected void writeWordSeparator() throws IOException
    {
            TextLine tmpline = null;

            tmpline = lines.get(lines.size() - 1);
            tmpline.text += getWordSeparator();
            tmpline.textPositions.add(null);

        super.writeWordSeparator();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        TextLine tmpline = null;

        if (startOfLine) {
            tmpline = new TextLine();
            tmpline.text = text;
            tmpline.textPositions = textPositions;
            lines.add(tmpline);
        } else {
            tmpline = lines.get(lines.size() - 1);
            tmpline.text += text;
            tmpline.textPositions.addAll(textPositions);
        }

        if (startOfLine) {
            startOfLine = false;
        }

        super.writeString(text, textPositions);
    }
}

It is about the representation of certain Unicode characters. 它是关于某些Unicode字符的表示。

What is a character? 什么是角色? That question is hard to answer. 这个问题很难回答。 Is à one character, or two (the a and ` on top of eachother)? à一个字符,或两个(在a`上海誓山盟的顶部)? It depends what you consider to be a character. 这取决于你认为是一个角色。

The accent graves ( ` ) you are seeing are actually combining diacritical marks . 你看到的重音坟墓( ` )实际上是结合了变音符号 Combining diacritical marks are separate Unicode characters, but are combined with the previous character by many text processors. 组合变音符号是单独的Unicode字符,但是与许多文本处理器的前一个字符组合。 For instance, java.text.Normalizer.normalize(str, Normalizer.Form.NFC) does such a job for you. 例如, java.text.Normalizer.normalize(str, Normalizer.Form.NFC)为您完成了这样的工作。

The library you are using (Apache PDFBox) possibly normalizes the text, so diacritics are combined with the preceding character. 您正在使用的库(Apache PDFBox)可能会对文本进行规范化,因此变音符号会与前一个字符组合在一起。 So in your text, some TextPosition instances contain two code points (more precisely, e` and a` ). 因此,在你的文字,一些TextPosition实例包含两个代码点(更准确地说, e`a` )。 So the length of the list with TextPosition instances is 65. 因此TextPosition实例的列表长度为65。

However, your String , which is in fact a CharSequence , holds 67 characters, because the diacritic itself takes up 1 char . 但是,你的String实际上是一个CharSequence ,它包含67个字符,因为变音符号本身占用1个char

System.out.println() just prints each character of the string, and that is represented as "dere che Geova e` il Creatore e Colui che da` la vita. Probabilmen-" System.out.println()只打印字符串的每个字符,并表示为“dere che Geova e` il Creatore e Colui che da` la vita.Probabilmen-”


Then why is the Netbeans debugger showing "dere che Geova è il Creatore e Colui che dà la vita. Probabilmen-" as value of the string? 那么为什么Netbeans调试器会显示"dere che Geova è il Creatore e Colui che dà la vita. Probabilmen-"作为字符串的值?

That is simply because the Netbeans debugger displays the normalized text for you. 这只是因为Netbeans调试器为您显示标准化文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM