简体   繁体   中英

How can I speed up my Java text file parser?

I am reading about 600 text files, and then parsing each file individually and add all the terms to a map so i can know the frequency of each word within the 600 files. (about 400MB).

My parser functions includes the following steps (ordered):

  • find text between two tags, which is the relevant text to read in each file.
  • lowecase all the text
  • string.split with multiple delimiters.
  • creating an arrayList with words like this: "aaa-aa", then adding to the string splitted above, and discounting "aaa" and "aa" to the String []. (i did this because i wanted "-" to be a delimiter, but i also wanted "aaa-aa" to be one word only, and not "aaa" and "aa".
  • get the String [] and map to a Map = new HashMap ... (word, frequency)
  • print everything.

It is taking me about 8min and 48 seconds, in a dual-core 2.2GHz, 2GB Ram. I would like advice on how to speed this process up. Should I expect it to be this slow? And if possible, how can I know (in netbeans), which functions are taking more time to execute?

unique words found: 398752.

CODE:

File file = new File(dir);
String[] files = file.list();

for (int i = 0; i < files.length; i++) {
    BufferedReader br = new BufferedReader(
        new InputStreamReader(
            new BufferedInputStream(
                new FileInputStream(dir + files[i])), encoding));
    try {
        String line;
        while ((line = br.readLine()) != null) {
            parsedString = parseString(line); // parse the string
            m = stringToMap(parsedString, m);
        }
    } finally {
        br.close();
    }
}

EDIT: Check this:

![enter image description here][1]

I don't know what to conclude.


EDIT: 80% TIME USED WITH THIS FUNCTION

    public String [] parseString(String sentence){
         // separators; ,:;'"\/<>()[]*~^ºª+&%$ etc..
        String[] parts = sentence.toLowerCase().split("[,\\s\\-:\\?\\!\\«\\»\\'\\´\\`\\\"\\.\\\\\\/()<>*º;+&ª%\\[\\]~^]");

        Map<String, String> o = new HashMap<String, String>(); // save the hyphened words, aaa-bbb like Map<aaa,bbb>

        Pattern pattern = Pattern.compile("(?<![A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû-])[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+-[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+(?![A-Za-z-])");
        Matcher matcher = pattern.matcher(sentence);

    // Find all matches like this: ("aaa-bb or bbb-cc") and put it to map to later add this words to the original map and discount the single words "aaa-aa" like "aaa" and "aa"
        for(int i=0; matcher.find(); i++){
           String [] tempo = matcher.group().split("-");
           o.put(tempo[0], tempo[1]);
        }
        //System.out.println("words: " + o);


        ArrayList temp = new ArrayList();
        temp.addAll(Arrays.asList(parts));

        for (Map.Entry<String, String> entry : o.entrySet()) {
            String key = entry.getKey();
            String value = entry.getValue();
            temp.add(key+"-"+value);
            if(temp.indexOf(key)!=-1){
                temp.remove(temp.indexOf(key));
            }
            if(temp.indexOf(value)!=-1){
                temp.remove(temp.indexOf(value));
            }
        }


        String []strArray = new String[temp.size()];
        temp.toArray(strArray);
                return strArray;

  }

600 files, each file about 0.5MB

EDIT3#- The pattern is no longer compiling each time a line is read. The new images are:

在此处输入图片说明

2 : 在此处输入图片说明

Be sure to increase your heap size, if you haven't already, using -Xmx. For this app, the impact may be striking.

The parts of your code that are likely to have the largest performance impact are the ones that are executed the most - which are the parts you haven't shown.

Update after memory screenshot

Look at all those Pattern$6 objects in the screenshot. I think you're recompiling the pattern a lot - maybe for every line. That would take a lot of time.

Update 2 - after code added to question.

Yup - two patterns compiled on every line - the explicit one, and also the "-" in the split (much cheaper, of course). I wish they hadn't added split() to String without it taking a compiled pattern as an argument. I see some other things that could be improved, but nothing else like the big compile. Just compile the pattern once, outside this function, maybe as a static class member.

If you aren't already doing it, use BufferedInputStream and BufferedReader to read the files. Double-buffering like that is measurably better than using BufferedInputStream or BufferedReader alone. Eg:

BufferedReader rdr = new BufferedReader(
    new InputStreamReader(
        new BufferedInputStream(
            new FileInputStream(aFile)
        )
        /* add an encoding arg here (e.g., ', "UTF-8"') if appropriate */
    )
);

If you post relevant parts of your code, there'd be a chance we could comment on how to improve the processing.

EDIT:

Based on your edit, here are a couple of suggestions:

  1. Compile the pattern once and save it as a static variable, rather than compiling every time you call parseString .
  2. Store the values of temp.indexOf(key) and temp.indexOf(value) when you first call them and then use the stored values instead of calling indexOf a second time.

Try to use to single regex that has a group that matches each word that is within tags - so a single regex could be used for your entire input and there would be not separate "split" stage.

Otherwise your approach seems reasonable, although I don't understand what you mean by "get the String [] ..." - I thought you were using an ArrayList. In any event, try to minimize the creation of objects, for both construction cost and garbage collection cost.

Is it just the parsing that's taking so long, or is it the file reading as well?

For the file reading, you can probably speed that up by reading the files on multiple threads. But first step is to figure out whether it's the reading or the parsing that's taking all the time so you can address the right issue.

通过Netbeans探查器运行代码,找出最耗时的时间(在项目上单击鼠标右键并选择探查,请确保您没有浪费时间)。

Nothing in the code that you have shown us is an obvious source of performance problems. The problem is likely to be something to do with the way that you are parsing the lines or extracting the words and putting them into the map. If you want more advice you need to post the code for those methods, and the code that declares / initializes the map.

My general advice would be to profile the application and see where the bottlenecks are, and use that information to figure out what needs to be optimized.

@Ed Staub's advice is also sound. Running an application with a heap that is too small can result serious performance problems.

It looks like its spending most of it time in regular expressions. I would firstly try writing the code without using a regular expression and then using multiple threads as if the process still appears to be CPU bound.

For the counter, I would look at using TObjectIntHashMap to reduce the overhead of the counter. I would use only one map, not create an array of string - counts which I then use to build another map, this could be a significant waste of time.

预编译模式,而不是每次都通过该方法进行编译,并摆脱双重缓冲:使用new BufferedReader(new FileReader(...))。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM