[英]Java: Get random line from a big file
I've seen how to get a random line from a text file , but the method stated there (the accepted answer) is running horrendously slow. 我已经看到了如何从文本文件中获取随机行 ,但是在那里声明的方法(公认的答案)的运行速度非常慢。 It runs very slowly on my 598KB text file, and still slow on my a version of that text file which has only one out of every 20 lines, at 20KB.
它在我的598KB文本文件上运行非常慢,而在该文本文件的版本上却运行缓慢,该文本文件每20行只有一个,为20KB。 I never get past the "a" section (it's a wordlist).
我从不超过“ a”部分(这是一个单词列表)。
The original file has 64141 lines; 原始文件有64141行; the shortened one has 2138 lines.
缩短的有2138行。 To generate these files, I took the Linux Mint 11
/usr/share/dict/american-english
wordlist and used grep
to remove anything with uppercase or an apostrophe ( grep -v [[:upper:]] | grep -v \\'
). 为了生成这些文件,我使用了Linux Mint 11
/usr/share/dict/american-english
单词列表,并使用grep
删除了大写或撇号的任何内容( grep -v [[:upper:]] | grep -v \\'
)。
The code I'm using is 我正在使用的代码是
String result = null;
final Random rand = new Random();
int n = 0;
for (final Scanner sc = new Scanner(wordList); sc.hasNext();) {
n++;
if (rand.nextInt(n) == 0) {
final String line = sc.nextLine();
boolean isOK = true;
for (final char c : line.toCharArray()) {
if (!(constraints.isAllowed(c))) {
isOK = false;
break;
}
}
if (isOK) {
result = line;
}
System.out.println(result);
}
}
return result;
which is slightly adapted from Itay's answer . 与Itay的答案略有不同。
The object constraints
is a KeyboardConstraints
, which basically has the one method isAllowed(char)
: 对象
constraints
是KeyboardConstraints
,它基本上具有一种方法isAllowed(char)
:
public boolean isAllowed(final char key) {
if (allAllowed) {
return true;
} else {
return allowedKeys.contains(key);
}
}
where allowedKeys
and allAllowed
are provided in the constructor. allowedKeys
allAllowed
中提供了allowedKeys
和allAllowed
位置。 The constraints
variable used here has "aeouhtns".toCharArray()
as its allowedKeys
with allAllowed
off. 的
constraints
变量这里使用具有"aeouhtns".toCharArray()
作为其allowedKeys
与allAllowed
关闭。
Essentially, what I want the method to do is to pick a random word that satisfies the constraints (eg for these constraints, "outvote" would work, but not "worker", because "w" is not in "aeouhtns".toCharArray()
). 本质上,我希望该方法要做的是选择一个满足约束条件的随机词(例如,对于这些约束条件,“ outvote”将起作用,而“ worker”则不起作用,因为“ w”不在
"aeouhtns".toCharArray()
)。
How can I do this? 我怎样才能做到这一点?
You have a bug in your implementation. 您的实现中存在错误。 You should read the line before you choose a random number.
选择随机数之前,应先阅读该行。 Change this:
更改此:
n++;
if (rand.nextInt(n) == 0) {
final String line = sc.nextLine();
To this (as in the original answer ): 为此(如原始答案所示 ):
n++;
final String line = sc.nextLine();
if (rand.nextInt(n) == 0) {
You should also check the constraints before drawing a random number. 您还应该在绘制随机数之前检查约束。 If a line fails the constraints it should be ignored, something like this:
如果一行未通过约束,则应将其忽略,如下所示:
n++;
String line;
do {
if (!sc.hasNext()) { return result; }
line = sc.nextLine();
} while (!meetsConstraints(line));
if (rand.nextInt(n) == 0) {
result = line;
}
I would read in all the lines, save these somewhere and then select a random line from that. 我会阅读所有行,将它们保存在某处,然后从中选择随机行。 This takes a trivial amount of time because a single file of less than 1 MB is a trivial size these days.
这需要很短的时间,因为如今小于1 MB的单个文件的大小很小。
public class Main {
public static void main(String... args) throws IOException {
long start = System.nanoTime();
RandomDict dict = RandomDict.load("/usr/share/dict/american-english");
final int count = 1000000;
for (int i = 0; i < count; i++)
dict.nextWord();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to load and find %,d random words.", time / 1e9, count);
}
}
class RandomDict {
public static final String[] NO_STRINGS = {};
final Random random = new Random();
final String[] words;
public RandomDict(String[] words) {
this.words = words;
}
public static RandomDict load(String filename) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(filename));
Set<String> words = new LinkedHashSet<String>();
try {
for (String line; (line = br.readLine()) != null; ) {
if (line.indexOf('\'') >= 0) continue;
words.add(line.toLowerCase());
}
} finally {
br.close();
}
return new RandomDict(words.toArray(NO_STRINGS));
}
public String nextWord() {
return words[random.nextInt(words.length)];
}
}
prints 版画
Took 0.091 seconds to load and find 1,000,000 random words.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.