简体   繁体   中英

How to speed up parsing of a file into a HashMap

I wrote the following function that parses a big text file (around 2 GB) into a Map line by line, effectively counting the occurrences of each word. I am interested only in words (lowercase to avoid entries repetition), no punctuation or whitespaces. Execution of the following code on the big file however takes almost 3 minutes. I am wondering why and whether there is a way to speed it up.

import java.util.*;

public class Stream {

Map<String, Integer> map = new HashMap();

public void getLines() {

    try (BufferedReader fileReader = new BufferedReader(new FileReader("resources/hugeFile"))) {
        String line ;
        while ((line = fileReader.readLine()) != null) {
            String[] words = line.toLowerCase().replaceAll("[^a-z ]", "").split("\\s+");
            for (int i = 0; i < words.length; i++) {
                if (map.get(words[i]) == null) {
                    map.put(words[i], 1);
                }
                else {
                    int newValue = Integer.valueOf(String.valueOf(map.get(words[i])));
                    newValue++;
                    map.put(words[i], newValue);
                }
            }
        }

    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

First of all, if you're serious about optimization, you have to measure the performance. Because many of the "improvements" which seem to be "improvements" may prove to bring either nothing or even worsen the performance. In many cases the compiler optimizes code better than human. So you have to do benchmarks, please see the following question on it:

How do I write a correct micro-benchmark in Java?

I'm posting two sketches of code below. These are really just sketches, to give a rough idea. I have neither tested them nor benchmarked.

One hint is that you access the map too much. You check it with map.get and then conditionally put value with map.put . You can use putIfAbsent or computeIfAbsent instead. Also the way you increment the existing value may be improved. I'd use a mutable AtomicInteger instead of immutable Integer in this case. So I'd suggest the following:

    Map<String, AtomicInteger> map = new HashMap<>();

    Consumer<String> countWords = word -> map.computeIfAbsent(word, (w) -> new AtomicInteger(0)).incrementAndGet();

    try (BufferedReader fileReader = new BufferedReader(new FileReader("resources/hugeFile"))) {
        String line;
        while ((line = fileReader.readLine()) != null) {
            splitAndConsumeWords(line, countWords);
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

Next, you used line.toLowerCase().replaceAll("[^az ]", "").split("\\\\s+") to convert the string to lowercase, retain only letters and spaces and split the string to words. I don't know it for sure without a benchmark, but I suspect that this is probably the most time-consuming operation in your code. And it is not a big deal to rewrite it without regular expressions. All you need is to iterate over the characters of string, convert them to lowercase, append to the current word or throw away. Here's how I'd do it.

I'd create an array which maps every character to its replacement. Same character for az or space, lowercase for AZ . All other characters will be mapped to 0 meaning they should be thrown away:

private static char[] ONLY_LETTERS_TO_LOWERCASE = new char[65535];

static {
    ONLY_LETTERS_TO_LOWERCASE[' '] = ' ';
    for (char c = 'a'; c <= 'z'; c++) {
        ONLY_LETTERS_TO_LOWERCASE[c] = c;
    }
    for (char c = 'A'; c <= 'Z'; c++) {
        ONLY_LETTERS_TO_LOWERCASE[c] = Character.toLowerCase(c);
    }
}

Then you simply look up a replacement for each character and build words:

public static void splitAndConsumeWords(String line, Consumer<String> wordsConsumer) {

    char[] characters = line.toCharArray();
    StringBuilder sb = new StringBuilder(16);
    for (int index = 0; index < characters.length; index++) {
        char ch = characters[index];
        char replacementCh = ONLY_LETTERS_TO_LOWERCASE[ch];
        // If we encounter a space
        if (replacementCh == ' ') {
            // And there is a word in string builder
            if (sb.length() > 0) {
                // Send this word to the consumer
                wordsConsumer.accept(sb.toString());
                // Reset the string builder
                sb.setLength(0);
            }
        } else if (replacementCh != 0) {
            sb.append(replacementCh);
        }
    }
    // Send the last word to the consumer
    if (sb.length() > 0) {
        wordsConsumer.accept(sb.toString());
    }
}

An alternative to the ONLY_LETTERS_TO_LOWERCASE mapping table would be an if statement like:

        if (ch >= 'a' && ch <= 'z' || ch == ' ') {
            replacementCh = ch;
        } else if (ch >= 'A' && ch <= 'Z') {
            replacementCh = Character.toLowerCase(ch);
        }
        else {
            replacementCh = 0; 
        }

I don't know for sure what will work better, I think lookup in an array must be faster but I'm not sure. This is why you'd ultimately need benchmarking.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM