在Java中輸出和預讀Unicode字符串

Question

我在對象中有正常的String屬性，包含重音字符。 如果我調試軟件（使用Netbeans），進入變量面板我將以正確的方式看到該字符串：

但是當我要用System.out.println打印變量時，我會看到奇怪的東西：

正如您所看到的，每個“à”變成“a”等等，這將導致錯誤的字符數，即使在字符串上的Matcher中也是如此。

我怎么解決這個問題？ 我需要重音字符，要有正確的字符數，並在其上使用匹配器。 我嘗試了很多方法但是沒有用，肯定我錯過了什么。

提前致謝。

編輯

再次編輯

這是代碼：

public class TextLine {
    public List<TextPosition> textPositions = null;
    public String text = "";
}

public class myStripper extends PDFTextStripper {

    public ArrayList<TextLine> lines = null;

    boolean startOfLine = true;

    public myStripper() throws IOException
    {
    }

    private void newLine() {
        startOfLine = true;
    }

    @Override
    protected void startPage(PDPage page) throws IOException
    {
        newLine();
        super.startPage(page);
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        newLine();
        super.writeLineSeparator();
    }

    @Override
    public String getText(PDDocument doc) throws IOException
    {
        lines = new ArrayList<TextLine>();
        return super.getText(doc);
    }

    @Override
    protected void writeWordSeparator() throws IOException
    {
            TextLine tmpline = null;

            tmpline = lines.get(lines.size() - 1);
            tmpline.text += getWordSeparator();
            tmpline.textPositions.add(null);

        super.writeWordSeparator();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        TextLine tmpline = null;

        if (startOfLine) {
            tmpline = new TextLine();
            tmpline.text = text;
            tmpline.textPositions = textPositions;
            lines.add(tmpline);
        } else {
            tmpline = lines.get(lines.size() - 1);
            tmpline.text += text;
            tmpline.textPositions.addAll(textPositions);
        }

        if (startOfLine) {
            startOfLine = false;
        }

        super.writeString(text, textPositions);
    }
}

Answer 1

它是關於某些Unicode字符的表示。

什么是角色？ 這個問題很難回答。 被à一個字符，或兩個（在a和`上海誓山盟的頂部）？ 這取決於你認為是一個角色。

你看到的重音墳墓（ ` ）實際上是結合了變音符號。 組合變音符號是單獨的Unicode字符，但是與許多文本處理器的前一個字符組合。 例如， java.text.Normalizer.normalize(str, Normalizer.Form.NFC)為您完成了這樣的工作。

您正在使用的庫（Apache PDFBox）可能會對文本進行規范化，因此變音符號會與前一個字符組合在一起。 因此，在你的文字，一些TextPosition實例包含兩個代碼點（更准確地說， e`和a` ）。 因此TextPosition實例的列表長度為65。

但是，你的String實際上是一個CharSequence ，它包含67個字符，因為變音符號本身占用1個char 。

System.out.println()只打印字符串的每個字符，並表示為“dere che Geova e` il Creatore e Colui che da` la vita.Probabilmen-”

那么為什么Netbeans調試器會顯示"dere che Geova è il Creatore e Colui che dà la vita. Probabilmen-"作為字符串的值？

這只是因為Netbeans調試器為您顯示標准化文本。

在Java中輸出和預讀Unicode字符串

問題描述

1 個解決方案

解決方案1
3 已采納 2017-09-13 11:56:19

在Java中輸出和預讀Unicode字符串

問題描述

1 個解決方案

解決方案1 3 已采納 2017-09-13 11:56:19

解決方案1
3 已采納 2017-09-13 11:56:19