简体   繁体   English

Java Spark flatMap似乎丢失了ArrayList中的项目

[英]Java Spark flatMap seems to be losing items in ArrayList

I'm iterating over billions of rows in cassandra with the spark/cassandra driver and pulling out data to run statistics on. 我正在使用spark / cassandra驱动程序在cassandra中遍历数十亿行,并提取数据以运行统计信息。 To accomplish this I am running a FOR loop over each row of data and if it falls within the criteria of a bucket of data i'm calling "channel" then I add it to an ArrayList in the form of of K,V pair of channel,power. 为此,我在每行数据上运行一个FOR循环,如果它属于一类数据的标准,我称之为“通道”,那么我将其以K,V对形式添加到ArrayList中。渠道,力量。

[[Channel,Power]] [[Channel,Power]]

The channels should be static based on the iteration increment of the for loop. 基于for循环的迭代增量,通道应为静态。 For example if my channels range is 0 through 10 with an increment of 2 then the channels would be 0,2,4,6,8,10 例如,如果我的频道范围是0到10(增量为2),那么频道将是0、2、4、6、8、10

The FOR loop runs on the current row of data and checks to see if the data falls within the channel and if so adds it to the ArrayList Data in the format of [[Channel,Power]] FOR循环在当前数据行上运行,并检查数据是否落在通道内,如果是,则以[[Channel,Power]]的格式将其添加到ArrayList Data中

Then proceeds to the next row and does the same. 然后前进到下一行并执行相同的操作。 Once it goes over all the rows it then increments to the next channel and repeats the process. 一旦遍历所有行,它将递增到下一个通道并重复该过程。

The issue is there are billions of rows that qualify for the same channel so I'm not sure if I should be using an ArrayList and flatMap or something else since my results to be slightly different each time I run it and the channels are not static as they should be. 问题是有数十亿行符合同一通道的条件,所以我不确定是否应该使用ArrayListflatMap或其他功能,因为每次运行它的结果都会略有不同,并且通道不是静态的像他们应该的那样。

A small sample of data [[Channel,Power]] would be: 一小部分数据[[Channel,Power]]将是:

[[2,5]]
[[2,10]]
[[2,5]]
[[2,15]]
[[2,5]]

Notice that there my be items that are duplicate that need to remain since I run min,max,average stats on each of these Channels. 请注意,由于我在每个这些频道上运行min,max,average统计信息,因此需要保留一些重复项。

Channel 2: Min 5, Max 15, Avg 8 频道2:最低5,最高15,平均8

My Code is as follows: 我的代码如下:

JavaRDD<MeasuredValue> rdd = javaFunctions(sc).cassandraTable("SparkTestB", "Measured_Value", mapRowTo )
            .select("Start_Frequency","Bandwidth","Power");
    JavaRDD<Value> valueRdd = rdd.flatMap(new FlatMapFunction<MeasuredValue, Value>(){
      @Override
      public Iterable<Value> call(MeasuredValue row) throws Exception {
        long start_frequency = row.getStart_frequency();
        float power = row.getPower();
        long bandwidth = row.getBandwidth();

        // Define Variable
        long channel,channel_end, increment; 

        // Initialize Variables
        channel_end = 10;
        increment = 2;

        List<Value> list = new ArrayList<>();
        // Create Channel Power Buckets
        for(channel = 0; channel <= channel_end; ){
          if( (channel >= start_frequency) && (channel <= (start_frequency + bandwidth)) ) {
            list.add(new Value(channel, power));
          } // end if
          channel+=increment;
        } // end for 
        return list; 
      }
    });

     sqlContext.createDataFrame(valueRdd, Value.class).groupBy(col("channel"))
     .agg(min("power"), max("power"), avg("power"))
     .write().mode(SaveMode.Append)     
     .option("table", "results")
     .option("keyspace", "model")
     .format("org.apache.spark.sql.cassandra").save();

My Classes are a follows for the Reflection: 我的课程是对反射的追随:

public class Value implements Serializable {
    public Value(Long channel, Float power) {
        this.channel = channel;
        this.power = power;
    }
    Long channel;
    Float power;

    public void setChannel(Long channel) {
        this.channel = channel;
    }
    public void setPower(Float power) {
        this.power = power;
    }
    public Long getChannel() {
        return channel;
    }
    public Float getPower() {
        return power;
    }

    @Override
    public String toString() {
        return "[" +channel +","+power+"]";
    }
}

public static class MeasuredValue implements Serializable {
        public MeasuredValue() { }

        public long start_frequency;
        public long getStart_frequency() { return start_frequency; }
        public void setStart_frequency(long start_frequency) { this.start_frequency = start_frequency; }

        public long bandwidth ;
        public long getBandwidth() { return bandwidth; }
        public void setBandwidth(long bandwidth) { this.bandwidth = bandwidth; }

        public float power;    
        public float getPower() { return power; }
        public void setPower(float power) { this.power = power; }

    }

I discovered that the discrepancies were do to my channelization algorithm. 我发现差异与我的频道化算法有关。 I replaced with the following to solve the problem. 我用以下替换来解决问题。

        // Create Channel Power Buckets
        for(; channel <= channel_end; channel+=increment ){ 
            //Initial Bucket
            while((start_frequency >= channel) && (start_frequency < (channel + increment))){
                list.add(new Value(channel, power));
                channel+=increment;
            }
            //Buckets to Accomodate for Bandwidth
            while ((channel <= channel_end) && (channel >= start_frequency) && (start_frequency + bandwidth) >= channel){
                list.add(new Value(channel, power));                           
                channel+=increment;
            }                   
        }  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM