简体   繁体   中英

Slower than expected Java regex performance

I have been tasked with reading large CSV files (300k+ records) and apply regexp patterns to each record. I have always been a PHP developer and never really tried any other languages, but decided I should take the dive and attempt to do this with Java which I assumed would be much faster.

In fact, just reading the CSV file line by line was 3x faster in Java. However, when I applied the regexp requirements, the Java implementation proved to take 10-20% longer than the PHP script.

It is very well possible that I did something wrong in Java, because I just learned this as I went today. Below are the two scripts, any advice would be greatly appreciated. I really would like to not give up on Java for this particular project.

PHP CODE

<?php
$bgtime=time();
$patterns =array(
    "/SOME REGEXP/",
    "/SOME REGEXP/",                    
    "/SOME REGEXP/",    
    "/SOME REGEXP/" 
);   

$fh = fopen('largeCSV.txt','r');
while($currentLineString = fgetcsv($fh, 10000, ","))
{
    foreach($patterns AS $pattern)
    {
        preg_match_all($pattern, $currentLineString[6], $matches);
    }
}
fclose($fh);
print "Execution Time: ".(time()-$bgtime);

?>

JAVA CODE

import au.com.bytecode.opencsv.CSVReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.ArrayList;

public class testParser
{
    public static void main(String[] args)
    {
        long start = System.currentTimeMillis();


        String[] rawPatterns = {
                    "SOME REGEXP",
                    "SOME REGEXP",                    
                    "SOME REGEXP",    
                    "SOME REGEXP"    
        };

        ArrayList<Pattern> compiledPatternList = new ArrayList<Pattern>();        
        for(String patternString : rawPatterns)
        {
            Pattern compiledPattern = Pattern.compile(patternString);
            compiledPatternList.add(compiledPattern);
        }


        try{
            String fileName="largeCSV.txt";
            CSVReader reader = new CSVReader(new FileReader(fileName));

            String[] header = reader.readNext();
            String[] nextLine;
            String description;

            while( (nextLine = reader.readNext()) != null) 
            {
                description = nextLine[6];
                for(Pattern compiledPattern : compiledPatternList)
                {
                    Matcher m = compiledPattern.matcher(description);
                    while(m.find()) 
                    {
                        //System.out.println(m.group(0));
                    }                
                }
            }
        }

        catch(IOException ioe)
        {
            System.out.println("Blah!");
        }

        long end = System.currentTimeMillis();

        System.out.println("Execution time was "+((end-start)/1000)+" seconds.");
    }
}

Using a buffered reader might help performance get quite a bit better:

CSVReader reader = new CSVReader(new BufferedReader(new FileReader(fileName)));

I don't see anything glaringly wrong with your code. Try isolating the performance bottle-neck using a profiler. I find the netbeans profiler very user-friendly.

EDIT: Why speculate? Profile the app and get a detailed report of where the time is spent. Then work to resolve the inefficient areas. See http://profiler.netbeans.org/ for more information.

EDIT2: OK, I got bored and profiled this. My code is identical to yours and parsed a CSV file with 1,000 identical lines as follows:

SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP

Here are the results (obviously your results will differ as my regular expressions are trivial). However, it's plain to see that the regex processing is not your main area of concern.

在此处输入图像描述

Interestingly, if I apply a BufferedReader, the performance is enhanced by a whopping 18% (see below).

在此处输入图像描述

A few points to be noted here.

  1. You start measuring the time even before the patterns are compiled. Pattern.compile is a relatively expensive operation and may consume more time if the pattern is complex. Why not start measuring it after the compilation step?

  2. I'm not sure how efficient CSVReader class is.

  3. Rather than directly printing the matched results in the main thread itself, (as System.out.println is blocking and expensive) you could perhaps delegate printing to a different thread.

Several things:

  1. The regex has to be compiled only once and that should be at the startup of the server so doesn't really matter for the performance while its running.

  2. And most importantly you're writing a completely invalid benchmark for a long running java program. You're most certainly loading several classes while benchmarking and overall only testing the interpreter's performance and NOT the JIT which will obviously result in much worse performance. See this excellent post for how to write a valid benchmark in java. Most certainly this will remedy all alleged performance problems in this case.

I would recommend:

  • as somebody else has suggested, profile to see where the actual bottleneck is;
  • tell us what the actual regexes are: it may be that you're using some specific subpattern that isn't very efficient in Java's implementation.

It's quite possible that parts of PHP's regex engine are more optimised than Java's for specific expression types, and/or there's a way to optimise the actual expression that you're using.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM