簡體   English   中英

Java Spark flatMap似乎丟失了ArrayList中的項目

[英]Java Spark flatMap seems to be losing items in ArrayList

我正在使用spark / cassandra驅動程序在cassandra中遍歷數十億行,並提取數據以運行統計信息。 為此,我在每行數據上運行一個FOR循環,如果它屬於一類數據的標准,我稱之為“通道”,那么我將其以K,V對形式添加到ArrayList中。渠道,力量。

[[Channel,Power]]

基於for循環的迭代增量,通道應為靜態。 例如,如果我的頻道范圍是0到10(增量為2),那么頻道將是0、2、4、6、8、10

FOR循環在當前數據行上運行,並檢查數據是否落在通道內,如果是,則以[[Channel,Power]]的格式將其添加到ArrayList Data中

然后前進到下一行並執行相同的操作。 一旦遍歷所有行,它將遞增到下一個通道並重復該過程。

問題是有數十億行符合同一通道的條件,所以我不確定是否應該使用ArrayListflatMap或其他功能,因為每次運行它的結果都會略有不同,並且通道不是靜態的像他們應該的那樣。

一小部分數據[[Channel,Power]]將是:

[[2,5]]
[[2,10]]
[[2,5]]
[[2,15]]
[[2,5]]

請注意,由於我在每個這些頻道上運行min,max,average統計信息,因此需要保留一些重復項。

頻道2:最低5,最高15,平均8

我的代碼如下:

JavaRDD<MeasuredValue> rdd = javaFunctions(sc).cassandraTable("SparkTestB", "Measured_Value", mapRowTo )
            .select("Start_Frequency","Bandwidth","Power");
    JavaRDD<Value> valueRdd = rdd.flatMap(new FlatMapFunction<MeasuredValue, Value>(){
      @Override
      public Iterable<Value> call(MeasuredValue row) throws Exception {
        long start_frequency = row.getStart_frequency();
        float power = row.getPower();
        long bandwidth = row.getBandwidth();

        // Define Variable
        long channel,channel_end, increment; 

        // Initialize Variables
        channel_end = 10;
        increment = 2;

        List<Value> list = new ArrayList<>();
        // Create Channel Power Buckets
        for(channel = 0; channel <= channel_end; ){
          if( (channel >= start_frequency) && (channel <= (start_frequency + bandwidth)) ) {
            list.add(new Value(channel, power));
          } // end if
          channel+=increment;
        } // end for 
        return list; 
      }
    });

     sqlContext.createDataFrame(valueRdd, Value.class).groupBy(col("channel"))
     .agg(min("power"), max("power"), avg("power"))
     .write().mode(SaveMode.Append)     
     .option("table", "results")
     .option("keyspace", "model")
     .format("org.apache.spark.sql.cassandra").save();

我的課程是對反射的追隨:

public class Value implements Serializable {
    public Value(Long channel, Float power) {
        this.channel = channel;
        this.power = power;
    }
    Long channel;
    Float power;

    public void setChannel(Long channel) {
        this.channel = channel;
    }
    public void setPower(Float power) {
        this.power = power;
    }
    public Long getChannel() {
        return channel;
    }
    public Float getPower() {
        return power;
    }

    @Override
    public String toString() {
        return "[" +channel +","+power+"]";
    }
}

public static class MeasuredValue implements Serializable {
        public MeasuredValue() { }

        public long start_frequency;
        public long getStart_frequency() { return start_frequency; }
        public void setStart_frequency(long start_frequency) { this.start_frequency = start_frequency; }

        public long bandwidth ;
        public long getBandwidth() { return bandwidth; }
        public void setBandwidth(long bandwidth) { this.bandwidth = bandwidth; }

        public float power;    
        public float getPower() { return power; }
        public void setPower(float power) { this.power = power; }

    }

我發現差異與我的頻道化算法有關。 我用以下替換來解決問題。

        // Create Channel Power Buckets
        for(; channel <= channel_end; channel+=increment ){ 
            //Initial Bucket
            while((start_frequency >= channel) && (start_frequency < (channel + increment))){
                list.add(new Value(channel, power));
                channel+=increment;
            }
            //Buckets to Accomodate for Bandwidth
            while ((channel <= channel_end) && (channel >= start_frequency) && (start_frequency + bandwidth) >= channel){
                list.add(new Value(channel, power));                           
                channel+=increment;
            }                   
        }  

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM