简体   繁体   English

如何提取RDD内容并使用spark(scala)放入DataFrame中

[英]how to extract RDD content and put in a DataFrame using spark(scala)

What I am trying to do is simply to extract some information from an rdd and put it in a dataframe, using Spark (scala). 我想要做的就是使用Spark(scala)从rdd中提取一些信息并将其放入数据帧中。

So far, what I've done is to create a streaming pipeline, connecting to a kafka topic and put the content of the topic in a RDD : 到目前为止,我所做的是创建一个流传输管道,连接到kafka主题并将该主题的内容放入RDD中:

val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "test",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )



   .outputMode("complete")


    val topics = Array("vittorio")
    val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )

    val row = stream.map(record => record.value)
    row.foreachRDD { (rdd: RDD[String], time: Time) =>


      rdd.collect.foreach(println)

      val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
      import spark.implicits._
      val DF = rdd.toDF()

      DF.show()
    }

    ssc.start()             // Start the computation
    ssc.awaitTermination()

  }

  object SparkSessionSingleton {

    @transient  private var instance: SparkSession = _

    def getInstance(sparkConf: SparkConf): SparkSession = {
      if (instance == null) {
        instance = SparkSession
          .builder
          .config(sparkConf)
          .getOrCreate()
      }
      instance
    }
  }

Now, the content of my rdd is : 现在,我的rdd的内容是:

{"event":"bank.legal.patch","ts":"2017-04-15T15:18:32.469+02:00","svc":"dpbank.stage.tlc-1","request":{"ts":"2017-04-15T15:18:32.993+02:00","aw":"876e6d71-47c4-40f6-8c49-5dbd7b8e246b","end_point":"/bank/v1/legal/mxHr+bhbNqEwFvXGn4l6jQ==","method":"PATCH","app_instance":"e73e93d9-e70d-4873-8f98-b00c6fe4d036-1491406011","user_agent":"Dry/1.0.st/Android/5.0.1/Sam-SM-N910C","user_id":53,"user_ip":"151.14.81.82","username":"7cV0Y62Rud3MQ==","app_id":"db2ffeac6c087712530981e9871","app_name":"DrApp"},"operation":{"scope":"mdpapp","result":{"http_status":200}},"resource":{"object_id":"mxHr+bhbNqEwFvXGn4l6jQ==","request_attributes":{"legal_user":{"sharing_id":"mxHr+bhbNqEwFvXGn4l6jQ==","ndg":"","taxcode":"IQ7hUUphxFBXnI0u2fxuCg==","status":"INCOMPLETE","residence":{"city":"CAA","address":"Via Batto 44","zipcode":"926","country_id":18,"city_id":122},"business_categories":[5],"company_name":"4Gzb+KJk1XAQ==","vat_number":"162340159"}},"response_attributes":{"legal_user":{"sharing_id":"mGn4l6jQ==","taxcode":"IQ7hFBXnI0u2fxuCg==","status":"INCOMPLETE","residence":{"city":"CATA","address":"Via Bllo 44","zipcode":"95126","country_id":128,"city_id":12203},"business_categories":[5],"company_name":"4GnU/Nczb+KJk1XAQ==","vat_number":"12960159"}}},"class":"DPAPI"}

and doing val DF = rdd.toDF() is showing : 并执行val DF = rdd.toDF()显示:

+--------------------+
|               value|
+--------------------+
|{"event":"bank.le...|
+--------------------+

what I would like to achieve is a dataframe that will be populated as much as new RDD arrives from the streaming. 我要实现的是一个数据帧,它将填充与流中新的RDD一样多的数据帧。 A sort of union method butnot sure if is the correct way because I'm not sure all rdds will have the same schema. 一种union方法,但不确定是否正确,因为我不确定所有rdds都具有相同的架构。

for example, this is what I would like to achieve : 例如,这就是我想要实现的目标:

+--------------------+------------+----------+-----+
|                 _id|     user_ip|    status|_type|
+--------------------+------------+----------+-----+
|AVtJFVOUVxUyIIcAklfZ|151.14.81.82|INCOMPLETE|DPAPI|
|AVtJFVOUVxUyIIcAklfZ|151.14.81.82|INCOMPLETE|DPAPI|
+--------------------+------------+----------+-----+

thanks! 谢谢!

If your rdd is 如果你的rdd是

{"event":"bank.legal.patch","ts":"2017-04-15T15:18:32.469+02:00","svc":"dpbank.stage.tlc-1","request":{"ts":"2017-04-15T15:18:32.993+02:00","aw":"876e6d71-47c4-40f6-8c49-5dbd7b8e246b","end_point":"/bank/v1/legal/mxHr+bhbNqEwFvXGn4l6jQ==","method":"PATCH","app_instance":"e73e93d9-e70d-4873-8f98-b00c6fe4d036-1491406011","user_agent":"Dry/1.0.st/Android/5.0.1/Sam-SM-N910C","user_id":53,"user_ip":"151.14.81.82","username":"7cV0Y62Rud3MQ==","app_id":"db2ffeac6c087712530981e9871","app_name":"DrApp"},"operation":{"scope":"mdpapp","result":{"http_status":200}},"resource":{"object_id":"mxHr+bhbNqEwFvXGn4l6jQ==","request_attributes":{"legal_user":{"sharing_id":"mxHr+bhbNqEwFvXGn4l6jQ==","ndg":"","taxcode":"IQ7hUUphxFBXnI0u2fxuCg==","status":"INCOMPLETE","residence":{"city":"CAA","address":"Via Batto 44","zipcode":"926","country_id":18,"city_id":122},"business_categories":[5],"company_name":"4Gzb+KJk1XAQ==","vat_number":"162340159"}},"response_attributes":{"legal_user":{"sharing_id":"mGn4l6jQ==","taxcode":"IQ7hFBXnI0u2fxuCg==","status":"INCOMPLETE","residence":{"city":"CATA","address":"Via Bllo 44","zipcode":"95126","country_id":128,"city_id":12203},"business_categories":[5],"company_name":"4GnU/Nczb+KJk1XAQ==","vat_number":"12960159"}}},"class":"DPAPI"}

Then you can use sqlContext 's read.json to read the rdd to valid dataframe and then select only the needed fields as 然后,您可以使用sqlContextread.json来读取rdd到有效的dataframe ,然后仅select所需的字段作为

val df = sqlContext.read.json(sc.parallelize(rdd))

df.select($"request.user_id"as("user_id"),
          $"request.user_ip"as("user_ip"),
          $"request.app_id"as("app_id"),
          $"resource.request_attributes.legal_user.status"as("status"),
          $"class")
  .show(false)

This should result the following dataframe 这应该导致以下数据帧

+-------+------------+---------------------------+----------+-----+
|user_id|user_ip     |app_id                     |status    |class|
+-------+------------+---------------------------+----------+-----+
|53     |151.14.81.82|db2ffeac6c087712530981e9871|INCOMPLETE|DPAPI|
+-------+------------+---------------------------+----------+-----+

You can get required columns as you wish using above method. 使用上面的方法,您可以根据需要获取必需的columns I hope the answer is helpful 我希望答案是有帮助的

You can union current DataFrame with existing one: 您可以将当前的DataFrame与现有的DataFrame合并:

At first, create empty DataFrame at program start: 首先,在程序启动时创建空的DataFrame:

val df = // here create DF with required schema
df.createOrReplaceTempView("savedDF")

Now in foreachRDD: 现在在foreachRDD中:

// here we are in foreachRDD
val df = // create DataFrame from RDD
val existingCachedDF = spark.table("savedDF") // get reference to existing DataFrame
val union = existingCachedDF.union(df)
union.createOrReplaceTempView("savedDF")

Good idea will be to checkpoint DataFrame in some microbatches to reduce very very long DataFrame logical plan 好的主意是在某些微批中检查DataFrame以减少非常长的DataFrame逻辑计划

The other idea is to use Structured Streaming, which will replace Spark Streaming 另一个想法是使用结构化流,它将取代Spark流

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM