如何将 Map[string,Dataframe] 填充为 scala 中 Dataframe 中的列

Question

I have a Map[String, Dataframe] .我有一个Map[String, Dataframe] 。 I want to combine all the data inside that Map into a single Dataframe.我想将 Map 中的所有数据合并到一个 Dataframe 中。 Can a dataframe have a column of Map datatype? dataframe 是否可以具有 Map 数据类型的列？

def sample(dfs : Map[String,Dataframe]): Dataframe =
{
.........
}

Example:例子：

DF1 DF1

id name age
1  aaa  23
2  bbb  34

DF2 DF2

game  time  score
ludo  10    20
rummy 30    40

I pass the above two DFs as Map to the function.我将上述两个 DF 作为 Map 传递给 function。 Then put data of the each dataframes into a single column of the output dataframe as json format.然后将每个数据帧的数据放入 output dataframe 的单列中，格式为 json。

out DF出DF

+------+----+------+----+------+----+------+----+------+----+------+------+------+------+
|  column1                                                                              |
+------+----+------+----+------+----+------+----+------+----+------+------+------+------+
| [{"id":"1","name":"aaa","age":"23"},{"id":21","name":"bbb","age":"24"}]               |
| [{"game":"ludo","time":"10","score":"20"},{"game":"rummy","time":"30","score":"40"}]  |
+------+----+------+----+------+----+------+----+------+----+------+------+------+------+

Answer 1

Here is a solution specific to your use-case:这是特定于您的用例的解决方案：

import org.apache.spark.sql._

def sample(dfs : Map[String, DataFrame])(implicit spark: SparkSession): DataFrame =
  dfs
    .values
    .foldLeft(spark.emptyDataFrame)((acc, df) => acc.union(df))

The spark session is required to create the empty DataFrame accumulator to fold on.需要火花 session 来创建可折叠的空 DataFrame 蓄电池。

Alternatively if you can guarantee the Map is non empty.或者，如果您可以保证Map不为空。

def sample(dfs : Map[String, DataFrame]): DataFrame =
  dfs
    .values
    .reduce((acc, df) => acc.union(df))

Answer 2

You are asking to generate one row per dataframe.您要求为每个 dataframe 生成一行。 Be careful, if one of the dataframes is large enough so that it cannot be contained in one single executor, this code will break.请注意，如果其中一个数据帧足够大以至于不能包含在单个执行程序中，则此代码将中断。

Let's first generate data and the map dfs of type Map[String, DataFrame] .让我们首先生成数据和类型为Map[String, DataFrame]的 map dfs 。

val df1 = Seq((1, "aaa", 23), (2, "bbb", 34)).toDF("id", "name", "age")
val df2 = Seq(("ludo", 10, 20), ("rummy", 10, 40)).toDF("game", "time", "score")
dfs = Seq(df1, df2)

Then, for each dataframe of the map, we generate two columns.然后，对于 map 的每个 dataframe，我们生成两列。 big_map associates each column name of the dataframe to its value (cast in string to have a consistent type). big_map将 dataframe 的每个列名与其值相关联（转换为字符串以具有一致的类型）。 df simply contains the name of the dataframe. df仅包含 dataframe 的名称。 We then union all the dataframes with reduce and group by name (that's the part where every single dataframe ends up entirely in one row, and therefore one one executor).然后，我们将所有数据帧与reduce和按name分组（这就是每个 dataframe 最终完全排成一行的部分，因此一个执行器）。

dfs
    .toSeq
    .map{ case (name, df) => df
        .select(map(
             df.columns.flatMap(c => Seq(lit(c), col(c).cast("string"))) : _*
        ) as "big_map")
        .withColumn("df", lit(name))}
    .reduce(_ union _)
    .groupBy("df")
    .agg(collect_list('big_map) as "column1")
    .show(false)

+---+-----------------------------------------------------------------------------------+
|df |column1                                                                            |
+---+-----------------------------------------------------------------------------------+
|df0|[{id -> 1, name -> aaa, age -> 23}, {id -> 2, name -> bbb, age -> 34}]             |
|df1|[{game -> ludo, time -> 10, score -> 20}, {game -> rummy, time -> 10, score -> 40}]|
+---+-----------------------------------------------------------------------------------+

如何将 Map[string,Dataframe] 填充为 scala 中 Dataframe 中的列

问题描述

2 个解决方案

解决方案1
0 2021-11-23 14:30:01

解决方案2
0 2021-11-24 07:05:52

如何将 Map[string,Dataframe] 填充为 scala 中 Dataframe 中的列

问题描述

2 个解决方案

解决方案1 0 2021-11-23 14:30:01

解决方案2 0 2021-11-24 07:05:52

解决方案1
0 2021-11-23 14:30:01

解决方案2
0 2021-11-24 07:05:52