使用具有不同編碼的文件從RandomAccessFile讀取字符串

Question

我有一個1250的大文件。行只是一個接一個的單個波蘭語：

zając
dzieło
kiepsko
etc

我需要以非常快的方式從這個文件中隨機選擇10個唯一的行。 我這樣做但是當我打印這些單詞時，他們的編碼錯誤[zaj？c，dzie？o，kiepsko ...]，我需要UTF8。 所以我改變了我的代碼來讀取文件中的字節而不僅僅是讀取行，所以我的努力結束了這段代碼：

public List<String> getRandomWordsFromDictionary(int number) {
    List<String> randomWords = new ArrayList<String>();
    File file = new File("file.txt");
    try {
        RandomAccessFile raf = new RandomAccessFile(file, "r");

        for(int i = 0; i < number; i++) {
            Random random = new Random();
            int startPosition;
            String word;
            do {
                startPosition = random.nextInt((int)raf.length());
                raf.seek(startPosition);
                raf.readLine();
                word = grabWordFromDictionary(raf);
            } while(checkProbability(word));
            System.out.println("Word: " + word);
            randomWords.add(word);
        }
    } catch (IOException ioe) {
        logger.error(ioe.getMessage(), ioe);
    }
    return randomWords;
}

private String grabWordFromDictionary(RandomAccessFile raf) throws IOException {
    byte[] wordInBytes = new byte[15];
    int counter = 0;
    byte wordByte;
    char wordChar;
    String convertedWord;
    boolean stop = true;
    do {
        wordByte = raf.readByte();
        wordChar = (char)wordByte;
        if(wordChar == '\n' || wordChar == '\r' || wordChar == -1) {
            stop = false;
        } else {
            wordInBytes[counter] = wordByte;
            counter++;
        }           
    } while(stop);
    if(wordInBytes.length > 0) {
        convertedWord = new String(wordInBytes, "UTF8");
        return convertedWord;
    } else {
        return null;
    }
}

private boolean checkProbability(String word) {
    if(word.length() > MAX_LENGTH_LINE) {
        return true;
    } else {
        double randomDouble = new Random().nextDouble();
        double probability = (double) MIN_LENGTH_LINE / word.length();
        return probability <= randomDouble;         
    }
}

但有些事情是錯的。 你能看一下這段代碼並幫助我嗎？ 也許你看到一些明顯的錯誤，但對我來說並不明顯？ 我將不勝感激任何幫助。

Answer 1

你的文件是在1250年，所以你需要在1250解碼它，而不是UTF-8。 您可以在解碼過程后將其保存為UTF-8。

Charset w1250 = Charset.forName("Windows-1250");
convertedWord = new String(wordInBytes, w1250);

使用具有不同編碼的文件從RandomAccessFile讀取字符串

問題描述

1 個解決方案

解決方案1
4 已采納 2012-12-13 22:04:39

使用具有不同編碼的文件從RandomAccessFile讀取字符串

問題描述

1 個解決方案

解決方案1 4 已采納 2012-12-13 22:04:39

解決方案1
4 已采納 2012-12-13 22:04:39