简体   繁体   English

Java:从大文件中获取随机行

[英]Java: Get random line from a big file

I've seen how to get a random line from a text file , but the method stated there (the accepted answer) is running horrendously slow. 我已经看到了如何从文本文件中获取随机行 ,但是在那里声明的方法(公认的答案)的运行速度非常慢。 It runs very slowly on my 598KB text file, and still slow on my a version of that text file which has only one out of every 20 lines, at 20KB. 它在我的598KB文本文件上运行非常慢,而在该文本文件的版本上却运行缓慢,该文本文件每20行只有一个,为20KB。 I never get past the "a" section (it's a wordlist). 我从不超过“ a”部分(这是一个单词列表)。

The original file has 64141 lines; 原始文件有64141行; the shortened one has 2138 lines. 缩短的有2138行。 To generate these files, I took the Linux Mint 11 /usr/share/dict/american-english wordlist and used grep to remove anything with uppercase or an apostrophe ( grep -v [[:upper:]] | grep -v \\' ). 为了生成这些文件,我使用了Linux Mint 11 /usr/share/dict/american-english单词列表,并使用grep删除了大写或撇号的任何内容( grep -v [[:upper:]] | grep -v \\' )。

The code I'm using is 我正在使用的代码是

String result = null;
final Random rand = new Random();
int n = 0;
for (final Scanner sc = new Scanner(wordList); sc.hasNext();) {
    n++;
    if (rand.nextInt(n) == 0) {
    final String line = sc.nextLine();
        boolean isOK = true;
        for (final char c : line.toCharArray()) {
            if (!(constraints.isAllowed(c))) {
                isOK = false;
                break;
            }
        }
        if (isOK) {
            result = line;
        }
        System.out.println(result);
    }
}
return result;

which is slightly adapted from Itay's answer . Itay的答案略有不同。

The object constraints is a KeyboardConstraints , which basically has the one method isAllowed(char) : 对象constraintsKeyboardConstraints ,它基本上具有一种方法isAllowed(char)

public boolean isAllowed(final char key) {
    if (allAllowed) {
        return true;
    } else {
        return allowedKeys.contains(key);
    }
}

where allowedKeys and allAllowed are provided in the constructor. allowedKeys allAllowed中提供了allowedKeysallAllowed位置。 The constraints variable used here has "aeouhtns".toCharArray() as its allowedKeys with allAllowed off. constraints变量这里使用具有"aeouhtns".toCharArray()作为其allowedKeysallAllowed关闭。

Essentially, what I want the method to do is to pick a random word that satisfies the constraints (eg for these constraints, "outvote" would work, but not "worker", because "w" is not in "aeouhtns".toCharArray() ). 本质上,我希望该方法要做的是选择一个满足约束条件的随机词(例如,对于这些约束条件,“ outvote”将起作用,而“ worker”则不起作用,因为“ w”不在"aeouhtns".toCharArray() )。

How can I do this? 我怎样才能做到这一点?

You have a bug in your implementation. 您的实现中存在错误。 You should read the line before you choose a random number. 选择随机数之前,应先阅读该行。 Change this: 更改此:

n++;
if (rand.nextInt(n) == 0) {
    final String line = sc.nextLine();

To this (as in the original answer ): 为此(如原始答案所示 ):

n++;
final String line = sc.nextLine();
if (rand.nextInt(n) == 0) {

You should also check the constraints before drawing a random number. 您还应该在绘制随机数之前检查约束。 If a line fails the constraints it should be ignored, something like this: 如果一行未通过约束,则应将其忽略,如下所示:

n++;

String line;
do {
    if (!sc.hasNext()) { return result; }
    line = sc.nextLine();
} while (!meetsConstraints(line));

if (rand.nextInt(n) == 0) {
    result = line; 
}

I would read in all the lines, save these somewhere and then select a random line from that. 我会阅读所有行,将它们保存在某处,然后从中选择随机行。 This takes a trivial amount of time because a single file of less than 1 MB is a trivial size these days. 这需要很短的时间,因为如今小于1 MB的单个文件的大小很小。

public class Main {
    public static void main(String... args) throws IOException {
        long start = System.nanoTime();
        RandomDict dict = RandomDict.load("/usr/share/dict/american-english");
        final int count = 1000000;
        for (int i = 0; i < count; i++)
            dict.nextWord();
        long time = System.nanoTime() - start;
        System.out.printf("Took %.3f seconds to load and find %,d random words.", time / 1e9, count);
    }
}

class RandomDict {
    public static final String[] NO_STRINGS = {};
    final Random random = new Random();
    final String[] words;

    public RandomDict(String[] words) {
        this.words = words;
    }

    public static RandomDict load(String filename) throws IOException {
        BufferedReader br = new BufferedReader(new FileReader(filename));
        Set<String> words = new LinkedHashSet<String>();
        try {
            for (String line; (line = br.readLine()) != null; ) {
                if (line.indexOf('\'') >= 0) continue;
                words.add(line.toLowerCase());
            }
        } finally {
            br.close();
        }
        return new RandomDict(words.toArray(NO_STRINGS));
    }

    public String nextWord() {
        return words[random.nextInt(words.length)];
    }
}

prints 版画

Took 0.091 seconds to load and find 1,000,000 random words.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM