简体   繁体   中英

Slow spark application - java

I am trying to create a spark application that takes a dataset of lat , long , timestamp points and increases the cell count if they are inside a grid cell. The grid is comprised of 3d cells with lon , lat and time as the z-axis.

Now I have completed the application and it does what its supposed to, but it takes hours to scan the whole dataset(~9g). My cluster is comprised of 3 nodes with 4 cores,8g ram each and I am currently using 6 executors with 1 core and 2g each.

I am guessing that I can optimize the code quite a bit but is there like a big mistake in my code that results in this delay?

    //Create a JavaPairRDD with tuple elements. For each String line of lines we split the string 
//and assign latitude, longitude and timestamp of each line to sdx,sdy and sdt. Then we check if the data point of 
//that line is contained in a cell of the centroids list. If it is then a new tuple is returned
//with key the latitude, Longitude and timestamp (split by ",") of that cell and value 1.

    JavaPairRDD<String, Integer> pairs = lines.mapToPair(x -> {


        String sdx = x.split(" ")[2];
        String sdy = x.split(" ")[3];
        String sdt = x.split(" ")[0];

        double dx = Double.parseDouble(sdx);
        double dy = Double.parseDouble(sdy);
        int dt = Integer.parseInt(sdt);

        List<Integer> t = brTime.getValue();
        List<Point2D.Double> p = brCoo.getValue();

        double dist = brDist.getValue();
        int dur = brDuration.getValue();

        for(int timeCounter=0; timeCounter<t.size(); timeCounter++) {
            for ( int cooCounter=0; cooCounter < p.size(); cooCounter++) {

                double cx = p.get(cooCounter).getX();
                double cy = p.get(cooCounter).getY();
                int ct = t.get(timeCounter);

                String scx = Double.toString(cx);
                String scy = Double.toString(cy);
                String sct = Integer.toString(ct);

                if (dx > (cx-dist) && dx <= (cx+dist)) {
                    if (dy > (cy-dist) && dy <= (cy+dist)) {
                        if (dt > (ct-dur) && dt <= (ct+dur)) {

                            return new Tuple2<String, Integer>(scx+","+scy+","+sct,1);
                        }
                    }
                }
            }
        }
        return new Tuple2<String, Integer>("Out Of Bounds",1);
    });

One of the biggest factors that may contribute to costs in running a Spark map like this relates to data access outside of the RDD context, which means driver interaction. In your case, there are at least 4 accessors of variables where this occurs: brTime , brCoo , brDist , and brDuration . It also appears that you're doing some line parsing via String#split rather than leveraging built-ins. Finally, scx , scy , and sct are all calculated for each loop, though they're only returned if their numeric counterparts pass a series of checks, which means wasted CPU cycles and extra GC.

Without actually reviewing the job plan, it's tough to say whether the above will make performance reach an acceptable level. Check out your history server application logs and see if there are any stages which are eating up your time - once you've identified a culprit there, that's what actually needs optimizing.

I tried mappartitionstopair and also moved the calculations of scx,scy and sct so that they are calculated only if the point passes the conditions. The speed of the application has improved dramatically only 17 minutes! I believe that the mappartitionsopair was the biggest factor. Thanks a lot Mks and bsplosion!

Try to use mapPartitions it's more fast see this exapmle link; other thing to do is to put this part of code outside the loop timeCounter

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM