简体   繁体   English

将Array [Array [String]]的RDD转换为DataFrame

[英]Convert RDD of Array[Array[String]] to DataFrame

I have a dataset in the RDD format, where each entry is an Array[Array[String]] . 我有RDD格式的数据集,其中每个条目都是Array[Array[String]] Each entry is an array of key/value pairs, and each entry may not contain all possible keys. 每个条目都是key/value对的数组,每个条目可能不包含所有可能的键。

An example of a possible entry is [[K1, V1], [K2, V2], [K3, V3], [K5, V5], [K7, V7]] and another might be [[K1, V1], [K3, V3], [K21, V21]] . 可能输入的示例是[[K1, V1], [K2, V2], [K3, V3], [K5, V5], [K7, V7]] ,另一个可能是[[K1, V1], [K3, V3], [K21, V21]]

What I hope to achieve is to bring this RDD into a dataframe format. 我希望实现的是将该RDD转换为数据帧格式。 K1 , K2 , etc. always represent the same String over each of the rows (ie K1 is always "type" and K2 is always "color"), and I want to use these as the columns. K1K2等始终在每一行上表示相同的String (即K1始终为“类型”, K2始终为“颜色”),我想将它们用作列。 The values V1 , V2 , etc. differ over rows, and I want to use these to populate the values for the columns. values V1V2等在行中不同,我想用它们来填充列的values

I'm not sure how to achieve this, so I would appreciate any help/pointers. 我不确定如何实现此目标,因此我将不胜感激。

You can do something like, 你可以做类似的事情,

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SparkSession}
import java.util.UUID
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType

    val l1: Array[Array[String]] = Array(
      Array[String]("K1", "V1"),
      Array[String]("K2", "V2"),
      Array[String]("K3", "V3"),
      Array[String]("K5", "V5"),
      Array[String]("K7", "V7"))

    val l2: Array[Array[String]] = Array(
      Array[String]("K1", "V1"),
      Array[String]("K3", "V3"),
      Array[String]("K21", "V21"))

    val spark = SparkSession.builder().master("local").getOrCreate()
    val sc = spark.sparkContext

    val rdd = sc.parallelize(Array(l1, l2)).flatMap(x => {
      val id = UUID.randomUUID().toString
      x.map(y => Row(id, y(0), y(1)))
    })

    val schema = new StructType()
      .add("id", "String")
      .add("key", "String")
      .add("value", "String")

    val df = spark
      .createDataFrame(rdd, schema)
      .groupBy("id")
      .pivot("key").agg(last("value"))
      .drop("id")

    df.printSchema()
    df.show(false)

The schema and output looks something like, 模式和输出看起来像这样,

root
 |-- K1: string (nullable = true)
 |-- K2: string (nullable = true)
 |-- K21: string (nullable = true)
 |-- K3: string (nullable = true)
 |-- K5: string (nullable = true)
 |-- K7: string (nullable = true)

+---+----+----+---+----+----+
|K1 |K2  |K21 |K3 |K5  |K7  |
+---+----+----+---+----+----+
|V1 |null|V21 |V3 |null|null|
|V1 |V2  |null|V3 |V5  |V7  |
+---+----+----+---+----+----+

Note: this will produce null in missing places as shown in outputs. 注意:这将在缺少的地方产生null ,如输出所示。 pivot basically transposes the data set based on some column Hope this answers your question! pivot主要基于某列转置数据集希望这回答您的问题!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM