[英]Read String with RandomAccessFile from file with different encoding
I have a big file encoded 1250. Lines are just single polish words one after another: 我有一个1250的大文件。行只是一个接一个的单个波兰语:
zając
dzieło
kiepsko
etc
I need to choose random 10 unique lines from this file in a quite fast way. 我需要以非常快的方式从这个文件中随机选择10个唯一的行。 I did this but when I print these words they have wrong encoding [zaj?c, dzie?o, kiepsko...], I need UTF8.
我这样做但是当我打印这些单词时,他们的编码错误[zaj?c,dzie?o,kiepsko ...],我需要UTF8。 So I changed my code to read bytes from file not just read lines, so my efforts ended up with this code:
所以我改变了我的代码来读取文件中的字节而不仅仅是读取行,所以我的努力结束了这段代码:
public List<String> getRandomWordsFromDictionary(int number) {
List<String> randomWords = new ArrayList<String>();
File file = new File("file.txt");
try {
RandomAccessFile raf = new RandomAccessFile(file, "r");
for(int i = 0; i < number; i++) {
Random random = new Random();
int startPosition;
String word;
do {
startPosition = random.nextInt((int)raf.length());
raf.seek(startPosition);
raf.readLine();
word = grabWordFromDictionary(raf);
} while(checkProbability(word));
System.out.println("Word: " + word);
randomWords.add(word);
}
} catch (IOException ioe) {
logger.error(ioe.getMessage(), ioe);
}
return randomWords;
}
private String grabWordFromDictionary(RandomAccessFile raf) throws IOException {
byte[] wordInBytes = new byte[15];
int counter = 0;
byte wordByte;
char wordChar;
String convertedWord;
boolean stop = true;
do {
wordByte = raf.readByte();
wordChar = (char)wordByte;
if(wordChar == '\n' || wordChar == '\r' || wordChar == -1) {
stop = false;
} else {
wordInBytes[counter] = wordByte;
counter++;
}
} while(stop);
if(wordInBytes.length > 0) {
convertedWord = new String(wordInBytes, "UTF8");
return convertedWord;
} else {
return null;
}
}
private boolean checkProbability(String word) {
if(word.length() > MAX_LENGTH_LINE) {
return true;
} else {
double randomDouble = new Random().nextDouble();
double probability = (double) MIN_LENGTH_LINE / word.length();
return probability <= randomDouble;
}
}
But something is wrong. 但有些事情是错的。 Could you look at this code and help me?
你能看一下这段代码并帮助我吗? Maybe you see some obvious errors but not obvious for me?
也许你看到一些明显的错误,但对我来说并不明显? I will appreciate any help.
我将不胜感激任何帮助。
Your file is in 1250, so you need to decode it in 1250, not UTF-8. 你的文件是在1250年,所以你需要在1250解码它,而不是UTF-8。 You can save it as UTF-8 after the decoding process though.
您可以在解码过程后将其保存为UTF-8。
Charset w1250 = Charset.forName("Windows-1250");
convertedWord = new String(wordInBytes, w1250);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.