简体   繁体   中英

Java Spark flatMap seems to be losing items in ArrayList

I'm iterating over billions of rows in cassandra with the spark/cassandra driver and pulling out data to run statistics on. To accomplish this I am running a FOR loop over each row of data and if it falls within the criteria of a bucket of data i'm calling "channel" then I add it to an ArrayList in the form of of K,V pair of channel,power.

[[Channel,Power]]

The channels should be static based on the iteration increment of the for loop. For example if my channels range is 0 through 10 with an increment of 2 then the channels would be 0,2,4,6,8,10

The FOR loop runs on the current row of data and checks to see if the data falls within the channel and if so adds it to the ArrayList Data in the format of [[Channel,Power]]

Then proceeds to the next row and does the same. Once it goes over all the rows it then increments to the next channel and repeats the process.

The issue is there are billions of rows that qualify for the same channel so I'm not sure if I should be using an ArrayList and flatMap or something else since my results to be slightly different each time I run it and the channels are not static as they should be.

A small sample of data [[Channel,Power]] would be:

[[2,5]]
[[2,10]]
[[2,5]]
[[2,15]]
[[2,5]]

Notice that there my be items that are duplicate that need to remain since I run min,max,average stats on each of these Channels.

Channel 2: Min 5, Max 15, Avg 8

My Code is as follows:

JavaRDD<MeasuredValue> rdd = javaFunctions(sc).cassandraTable("SparkTestB", "Measured_Value", mapRowTo )
            .select("Start_Frequency","Bandwidth","Power");
    JavaRDD<Value> valueRdd = rdd.flatMap(new FlatMapFunction<MeasuredValue, Value>(){
      @Override
      public Iterable<Value> call(MeasuredValue row) throws Exception {
        long start_frequency = row.getStart_frequency();
        float power = row.getPower();
        long bandwidth = row.getBandwidth();

        // Define Variable
        long channel,channel_end, increment; 

        // Initialize Variables
        channel_end = 10;
        increment = 2;

        List<Value> list = new ArrayList<>();
        // Create Channel Power Buckets
        for(channel = 0; channel <= channel_end; ){
          if( (channel >= start_frequency) && (channel <= (start_frequency + bandwidth)) ) {
            list.add(new Value(channel, power));
          } // end if
          channel+=increment;
        } // end for 
        return list; 
      }
    });

     sqlContext.createDataFrame(valueRdd, Value.class).groupBy(col("channel"))
     .agg(min("power"), max("power"), avg("power"))
     .write().mode(SaveMode.Append)     
     .option("table", "results")
     .option("keyspace", "model")
     .format("org.apache.spark.sql.cassandra").save();

My Classes are a follows for the Reflection:

public class Value implements Serializable {
    public Value(Long channel, Float power) {
        this.channel = channel;
        this.power = power;
    }
    Long channel;
    Float power;

    public void setChannel(Long channel) {
        this.channel = channel;
    }
    public void setPower(Float power) {
        this.power = power;
    }
    public Long getChannel() {
        return channel;
    }
    public Float getPower() {
        return power;
    }

    @Override
    public String toString() {
        return "[" +channel +","+power+"]";
    }
}

public static class MeasuredValue implements Serializable {
        public MeasuredValue() { }

        public long start_frequency;
        public long getStart_frequency() { return start_frequency; }
        public void setStart_frequency(long start_frequency) { this.start_frequency = start_frequency; }

        public long bandwidth ;
        public long getBandwidth() { return bandwidth; }
        public void setBandwidth(long bandwidth) { this.bandwidth = bandwidth; }

        public float power;    
        public float getPower() { return power; }
        public void setPower(float power) { this.power = power; }

    }

I discovered that the discrepancies were do to my channelization algorithm. I replaced with the following to solve the problem.

        // Create Channel Power Buckets
        for(; channel <= channel_end; channel+=increment ){ 
            //Initial Bucket
            while((start_frequency >= channel) && (start_frequency < (channel + increment))){
                list.add(new Value(channel, power));
                channel+=increment;
            }
            //Buckets to Accomodate for Bandwidth
            while ((channel <= channel_end) && (channel >= start_frequency) && (start_frequency + bandwidth) >= channel){
                list.add(new Value(channel, power));                           
                channel+=increment;
            }                   
        }  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM