簡體   English   中英

Apache Spark Sql —分組依據

[英]Apache Spark Sql — Group By

我有以下來自Rabbit MQ的JSON數據

{"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:30","data":{"RunStatus":1"}}
{"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:35","data":{"RunStatus":3"}}
{"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:40","data":{"RunStatus":2"}}
{"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:45","data":{"RunStatus":3"}}
{"DeviceId":"MACH-101","TimeStamp":"29-06-2017 15:21:50","data":{"RunStatus":2"}}

{"DeviceId":"MACH-102","TimeStamp":"29-06-2017 15:21:35","data":{"RunStatus":1"}}
{"DeviceId":"MACH-102","TimeStamp":"29-06-2017 15:21:45","data":{"RunStatus":3"}}
{"DeviceId":"MACH-102","TimeStamp":"29-06-2017 15:21:50","data":{"RunStatus":2"}}
{"DeviceId":"MACH-102","TimeStamp":"29-06-2017 15:21:55","data":{"RunStatus":3"}}
{"DeviceId":"MACH-102","TimeStamp":"29-06-2017 15:22:00","data":{"RunStatus":2"}}

我正在嘗試獲取設備所在的每個RunStatus的持續時間,因此對於上面的數據說,對於Device-MACH-101,RunStatus看起來像這樣

在Runstatus 1中,設備處於-5秒(30-35)在Runstatus 2中,設備處於-5秒(40-45)在Runstatus 3中,設備處於-10秒(35-40 + 45- 50)

上面相同的邏輯也適用於第二設備數據。

以下是我正在嘗試的Apache Spark SQL查詢,但未獲得期望的結果。 請提出一些替代方案; 我也不介意以非SQL方式進行操作。

public static void main(String[] args) {

        try {

            mconf = new SparkConf();
            mconf.setAppName("RabbitMqReceiver");
            mconf.setMaster("local[*]");

            jssc = new JavaStreamingContext(mconf,Durations.seconds(10));

            SparkSession spksess = SparkSession
                    .builder()
                    .master("local[*]")
                    .appName("RabbitMqReceiver2")
                    .getOrCreate();

            SQLContext sqlctxt = new SQLContext(spksess);

            JavaDStream<String> strmData = jssc.receiverStream(new mqreceiver(StorageLevel.MEMORY_AND_DISK_2()));

            JavaDStream<String> machineData = strmData.window(Durations.minutes(1),Durations.seconds(10));

            sqlctxt.udf().register("custdatediff", new UDF2<String, String, String>() {

                @Override public String call(String argdt1,String argdt2) {

                        DateTimeFormatter formatter = DateTimeFormat.forPattern("dd-MM-yyyy HH:mm:ss");
                        DateTime dt1 = formatter.parseDateTime(argdt1);
                        DateTime dt2 = formatter.parseDateTime(argdt2);

                        Seconds retsec = org.joda.time.Seconds.secondsBetween(dt2, dt1);
                        return retsec.toString();

                 }
            },DataTypes.StringType);

            machineData.foreachRDD(new VoidFunction<JavaRDD<String>>() {

                @Override
                public void call(JavaRDD<String> rdd) {
                    if(!rdd.isEmpty()){

                        Dataset<Row> df = sqlctxt.jsonRDD(rdd);
                        df.createOrReplaceTempView("DeviceData");

                        // I DONT WANT to GROUP by timestamp, but query requires I pass it.

                        Dataset<Row> searchResult = sqlctxt.sql("select t1.DeviceId,t1.data.runstatus,"
                                + " custdatediff(CAST((t1.timestamp) as STRING),CAST((t2.timestamp) as STRING)) as duration from DeviceData t1"
                                + " join DeviceData t2 on t1.DeviceId = t2.DeviceId group by t1.DeviceId,t1.data.runstatus,t1.timestamp,t2.timestamp");

                        searchResult.show();

                    }
                }
            });

            jssc.start();

            jssc.awaitTermination();

        } catch (InterruptedException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

上面的代碼/ sql執行的示例結果如下

 +--------+---------+--------+ |DeviceId|runstatus|duration| +--------+---------+--------+ | NTC-167| 2| PT0S| | NTC-168| 2| PT0S| | NTC-168| 2| PT-10S| | NTC-168| 2| PT-15S| | NTC-168| 1| PT10S| | NTC-168| 1| PT0S| | NTC-168| 1| PT-5S| | NTC-168| 1| PT15S| | NTC-168| 1| PT5S| | NTC-168| 1| PT0S| +--------+---------+--------+ 

因此,您可以看到狀態正在重復,並且在重復的行中,其中之一具有正確的結果。 我寫的查詢也迫使我按時間戳分組,我想我是否可以避免按時間戳分組結果可能是正確的……對此不確定。

您可以嘗試使用數據框和窗口功能。 使用窗口功能中的“引線”,可以將當前行時間戳與下一行時間戳進行比較,並找到每個設備和運行狀態的差異。 像下面一樣

 val windowSpec_wk = Window.partitionBy(df1("DeviceID")).orderBy(df1("timestamp"))
 val df2 = df1.withColumn("period", lead(df1("timestamp"), 1).over(windowSpec_wk))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM