简体   繁体   English

在Java中处理文本文件中的UTF-8字符

[英]Processing UTF-8 characters in text file in java

I've a text file which contains the following sample UTF-8 text: 我有一个文本文件,其中包含以下示例UTF-8文本:

ኣእምሮኣዊ/ADJ ጥዕና/N ።/PUN

ቅድሚ/PRE ብዙሕ/ADJ ዓመታት/N “/PUN ኣእምሮኣዊ/ADJ ስንክልና/N ብጋኔን/N ወይ/CON እከይ/ADJ መናፍስቲ/N ኢዩ/V_AUX ዝመጽእ/V_REL “/PUN ዝብል/V_REL ግጉይ/ADJ ኣመለኻኽታ/N ነይሩ/V_GER ።/PUN

ከም/CON ውጺኢቱ/N ድማ/CON ኣእምሮኣዊ/ADJ ስንክልና/N ዘጋጠሞም/ADJ ኣባላት/N ናይ/PRE ሓደ/NUM ሕብረተ-ሰብ/N ብኣሰቃቕን/ADJ ኢሰብኣውን/ADJ ኣገባብ/N ይተሓዙ/V_IMF ነይሮም/V_AUX ።/PUN

Lingpipe implementation of HMM POS Tagger for Brown Corpus: 适用于Brown Corpus的HMM POS Tagger的Lingpipe实现:

BrownCorpus class reads the zipped POS Corpus as follows: BrownCorpus类按以下方式读取压缩的POS语料库:

public class BrownPosCorpus implements PosCorpus {

    private final File mBrownZipFile;

    public BrownPosCorpus(File brownZipFile) {
    mBrownZipFile = brownZipFile;
    }

    public Parser<ObjectHandler<Tagging<String>>> parser() {
    return new BrownPosParser();
    }

    public Iterator<InputSource> sourceIterator() throws IOException {
    return new BrownSourceIterator(mBrownZipFile);
    }

    static class BrownSourceIterator extends Iterators.Buffered<InputSource> {
    private ZipInputStream mZipIn = null;
    public BrownSourceIterator(File brownZipFile) throws IOException {
        FileInputStream fileIn = new FileInputStream(brownZipFile);
        mZipIn = new ZipInputStream(fileIn);
    }
    public InputSource bufferNext() {
        ZipEntry entry = null;
        try {
        while ((entry = mZipIn.getNextEntry()) != null) {
            if (entry.isDirectory()) continue;
            String name = entry.getName();
            if (name.equals("brown/CONTENTS") 
            || name.equals("brown/README")) continue;
            return new InputSource(mZipIn);
        }
        } catch (IOException e) {
        // ignore and close and return null
        }
        Streams.closeQuietly(mZipIn);
        return null;    
     }
    }
}

The BrownPosParser.java class parses the zipped brown pos corpus as follows: BrownPosParser.java类按如下方式解析压缩的棕色pos语料库:

public class BrownPosParser
     extends StringParser<ObjectHandler<Tagging<String>>> {

    @Override
    public void parseString(char[] cs, int start, int end) {
        String in = new String(cs,start,end-start);
        String[] sentences = in.split("\n");
        for (int i = 0; i < sentences.length; ++i)
            if (!Strings.allWhitespace(sentences[i]))
                processSentence(sentences[i]);
    }

    public String normalizeTag(String rawTag) {
        String tag = rawTag;
        String startTag = tag;
        // remove plus, default to first
        int splitIndex = tag.indexOf('+');
        if (splitIndex >= 0)
            tag = tag.substring(0,splitIndex);

        int lastHyphen = tag.lastIndexOf('-');
        if (lastHyphen >= 0) {
            String first = tag.substring(0,lastHyphen);
            String suffix = tag.substring(lastHyphen+1);
            if (suffix.equalsIgnoreCase("HL")
                || suffix.equalsIgnoreCase("TL")
                || suffix.equalsIgnoreCase("NC")) {
                tag = first;
            }
        }

        int firstHyphen = tag.indexOf('-');
        if (firstHyphen > 0) {
            String prefix = tag.substring(0,firstHyphen);
            String rest = tag.substring(firstHyphen+1);
            if (prefix.equalsIgnoreCase("FW")
                || prefix.equalsIgnoreCase("NC")
                || prefix.equalsIgnoreCase("NP"))
                tag = rest;
        }

        // neg last, and only if not whole thing
        int negIndex = tag.indexOf('*');
        if (negIndex > 0) {
            if (negIndex == tag.length()-1)
                tag = tag.substring(0,negIndex);
            else
                tag = tag.substring(0,negIndex)
                    + tag.substring(negIndex+1);
        }
        // multiple runs to normalize
        return tag.equals(startTag) ? tag : normalizeTag(tag);
    }

    private void processSentence(String sentence) {
        String[] tagTokenPairs = sentence.split(" ");
        List<String> tokenList = new ArrayList<String>(tagTokenPairs.length);
        List<String> tagList = new ArrayList<String>(tagTokenPairs.length);

        for (String pair : tagTokenPairs) {
            int j = pair.lastIndexOf('/');
            String token = pair.substring(0,j);
            String tag = normalizeTag(pair.substring(j+1));
            tokenList.add(token);
            tagList.add(tag);
        }
        Tagging<String> tagging
            = new Tagging<String>(tokenList,tagList);
        getHandler().handle(tagging);
    }
}

The problem is the following bug occured while parsing the UTF-8 corpus: The key problem is in the BrownPosParser.java: 问题是解析UTF-8语料库时发生以下错误:关键问题在BrownPosParser.java中:

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

[java]     at java.lang.String.substring(String.java:1967)

[java]     at BrownPosParser.processSentence(BrownPosParser.java:72)

The STACK TRACE is given below: 堆栈跟踪如下:

 C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags>ant eval-brown
 Buildfile: C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags\build.xml

 compile:

 [javac] Compiling 11 source files to C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags\build\classes


 eval-brown:

 [java] COMMAND PARAMETERS:

 [java]   Sent eval rate=5

 [java]   Toks before eval=1000000

 [java]   Max n-best eval=32

 [java]   Max n-gram=8

 [java]   Num chars=128

 [java]   Lambda factor=8.0

 [java] Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1

 [java]     at java.lang.String.substring(String.java:1967)

 [java]     at BrownPosParser.processSentence(BrownPosParser.java:72)

 [java]     at BrownPosParser.parseString(BrownPosParser.java:20)

 [java]     at com.aliasi.corpus.StringParser.parse(StringParser.java:71)

 [java]     at EvaluatePos.parseCorpus(EvaluatePos.java:123)

 [java]     at EvaluatePos.run(EvaluatePos.java:75)

 [java]     at EvaluatePos.main(EvaluatePos.java:183)

 [java] Java Result: 1

Which part of the code should I modify to properly parse the UTF-8 pos corpus? 我应该修改代码的哪一部分以正确地解析UTF-8 pos语料库?

Any help is much appreciated. 任何帮助深表感谢。

Not sure if it solves your issue; 不确定是否可以解决您的问题; but to set the charset change this line: 但是要设置字符集,请更改此行:

mZipIn = new ZipInputStream(fileIn);

to

mZipIn = new ZipInputStream(new BufferedInputStream(fileIn), Charset.forName("UTF-8"));

Locate and eliminate consecutive spaces, a space at the beginning or at the end of the line and check that all tokens have the / in the corpus. 找到并消除连续的空格,即行的开头或结尾处的空格,并检查所有标记在语料库中均带有/。

It works. 有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM