简体   繁体   English

Java Spark Streaming JSON解析

[英]Java Spark Streaming JSON parsing

I have started learning spark-streaming from Spark engine and very new to data analytics and spark. 我已经开始从Spark引擎学习火花流,并且很新的数据分析和火花。 I just want to create a small IOT application in which I want to forecast future data. 我只是想创建一个小IOT应用程序,我想在其中预测未来的数据。

I have Tiva hardware which sends realtime sensor JSON data as follows, 我有Tiva硬件,它发送实时传感器JSON数据如下,

[{"t":1478091719000,"sensors":[{"s":"s1","d":"+253.437"},{"s":"s2","d":"+129.750"},{"s":"s3","d":"+45.500"},{"s":"s4","d":"+255.687"},{"s":"s5","d":"+290.062"},{"s":"s6","d":"+281.500"},{"s":"s7","d":"+308.250"},{"s":"s8","d":"+313.812"}]}]

In this t is unix time stamp for which data is posted. 在此t中是发布数据的unix时间戳。 sensors is array of sensors with each sensor('s') data as 'd'. 传感器是传感器阵列,每个传感器('s')数据为'd'。

What I want to do is, consume this data and create object which spark-streaming and then pass all data though spark's Mlib (machine learning) or equivalent library to forecast future data. 我想要做的是,使用这些数据并创建火花流的对象,然后通过spark的Mlib(机器学习)或等效库传递所有数据以预测未来的数据。

I want a general idea whether this will be possible with all technology choices 我想要了解所有技术选择是否可行

  1. I have decided to use? 我决定用?
  2. How can I consume the nested JSON? 我如何使用嵌套的JSON? I tried using SQLContext but got no success. 我尝试使用SQLContext但没有成功。
  3. General guidelines to achieve what I am trying to do here. 实现我在这里尝试做的一般准则。

Here is code which I am using to consume messages from KAFKA. 这是我用来消费来自KAFKA的消息的代码。

SparkConf conf = new SparkConf().setAppName("DattusSpark").setMaster("local[2]");

    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(2000));

    // TODO: processing pipeline
    Map<String, String> kafkaParams = new HashMap<String, String>();
    kafkaParams.put("metadata.broker.list", "kafkaserver_address:9092");
    Set<String> topics = Collections.singleton("RAH");


    JavaPairInputDStream<String, String> directKafkaStream = 
            KafkaUtils.createDirectStream(ssc, String.class, String.class, StringDecoder.class,
                    StringDecoder.class, kafkaParams, topics);


    JavaDStream<String> json = directKafkaStream.map(new Function<Tuple2<String,String>, String>() {
        public String call(Tuple2<String,String> message) throws Exception {
            System.out.println(message._2());
            return message._2();
        };
    });


    System.out.println(" json is  0------ 0"+ json);



    json.foreachRDD(rdd -> {
        rdd.foreach(
                record -> System.out.println(record));
    });

    ssc.start();
    ssc.awaitTermination(); 

PS: I want to do this in Java as to maintain linearity and good performance. PS:我想在Java中做到这一点,以保持线性和良好的性能。

Since you are using SPark 2.0, from SparkSession, you can read the JSON 由于您使用的是SparkSession的SPark 2.0,因此您可以阅读JSON

json.foreachRDD( rdd -> {

      DataFrame df= spark.read.json(rdd)
      //process json with this DF.
}

OR you can convert the rdd to RDD of Row, then you can use createDataFrame method. 或者您可以将rdd转换为Row的RDD,然后您可以使用createDataFrame方法。

json.foreachRDD( rdd -> {

          DataFrame df= spark.createDataFrame(rdd);
          //process json with this DF.
    }

Nested JSON processing is possible from DF, you can follow this article. 嵌套JSON处理可能从DF,您可以按照文章。

Also, once you convert your json to DF, you can use it in any spark modules ( like spark sql, ML) 此外,一旦你将你的json转换为DF,你可以在任何火花模块中使用它(如spark sql,ML)

Answer to your questions: 回答你的问题:

1) Whether this will be possible with all technology choices I have decided to use? 1)我决定使用所有技术选择是否可行?

`Ans: Yes it can be done and quiet a normal use-case for spark.`

2) How can I consume the nested JSON? 2)如何使用嵌套的JSON? I tried using SQLContext but got no success. 我尝试使用SQLContext但没有成功。

`Ans: Nested JSON with SQLContext is little tricky. You may want to use Jackson or some other JSON library.`

3) General guidelines to achieve what I am trying to do here. 3)实现我在这里尝试做的一般准则。

Ans: Consuming messages through kafka seems fine, but only a limited machine learning algorithms are supported through streaming.

If you want to use other machine learning algorithms or third party library, perhaps you should consider the model creation as an batch job emmiting out the model at the end. 如果您想使用其他机器学习算法或第三方库,也许您应该将模型创建视为最后发出模型的批处理作业。 The streaming job should load the model and get stream of data and predict only. 流式传输作业应加载模型并获取数据流并仅进行预测。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM