[英]How to populate Map[string,Dataframe] as a column in a Dataframe in scala
I have a Map[String, Dataframe]
.我有一个
Map[String, Dataframe]
。 I want to combine all the data inside that Map into a single Dataframe.我想将 Map 中的所有数据合并到一个 Dataframe 中。 Can a dataframe have a column of Map datatype?
dataframe 是否可以具有 Map 数据类型的列?
def sample(dfs : Map[String,Dataframe]): Dataframe =
{
.........
}
Example:例子:
DF1 DF1
id name age
1 aaa 23
2 bbb 34
DF2 DF2
game time score
ludo 10 20
rummy 30 40
I pass the above two DFs as Map to the function.我将上述两个 DF 作为 Map 传递给 function。 Then put data of the each dataframes into a single column of the output dataframe as json format.
然后将每个数据帧的数据放入 output dataframe 的单列中,格式为 json。
out DF出DF
+------+----+------+----+------+----+------+----+------+----+------+------+------+------+
| column1 |
+------+----+------+----+------+----+------+----+------+----+------+------+------+------+
| [{"id":"1","name":"aaa","age":"23"},{"id":21","name":"bbb","age":"24"}] |
| [{"game":"ludo","time":"10","score":"20"},{"game":"rummy","time":"30","score":"40"}] |
+------+----+------+----+------+----+------+----+------+----+------+------+------+------+
Here is a solution specific to your use-case:这是特定于您的用例的解决方案:
import org.apache.spark.sql._
def sample(dfs : Map[String, DataFrame])(implicit spark: SparkSession): DataFrame =
dfs
.values
.foldLeft(spark.emptyDataFrame)((acc, df) => acc.union(df))
The spark session is required to create the empty DataFrame accumulator to fold on.需要火花 session 来创建可折叠的空 DataFrame 蓄电池。
Alternatively if you can guarantee the Map
is non empty.或者,如果您可以保证
Map
不为空。
def sample(dfs : Map[String, DataFrame]): DataFrame =
dfs
.values
.reduce((acc, df) => acc.union(df))
You are asking to generate one row per dataframe.您要求为每个 dataframe 生成一行。 Be careful, if one of the dataframes is large enough so that it cannot be contained in one single executor, this code will break.
请注意,如果其中一个数据帧足够大以至于不能包含在单个执行程序中,则此代码将中断。
Let's first generate data and the map dfs
of type Map[String, DataFrame]
.让我们首先生成数据和类型为
Map[String, DataFrame]
的 map dfs
。
val df1 = Seq((1, "aaa", 23), (2, "bbb", 34)).toDF("id", "name", "age")
val df2 = Seq(("ludo", 10, 20), ("rummy", 10, 40)).toDF("game", "time", "score")
dfs = Seq(df1, df2)
Then, for each dataframe of the map, we generate two columns.然后,对于 map 的每个 dataframe,我们生成两列。
big_map
associates each column name of the dataframe to its value (cast in string to have a consistent type). big_map
将 dataframe 的每个列名与其值相关联(转换为字符串以具有一致的类型)。 df
simply contains the name of the dataframe. df
仅包含 dataframe 的名称。 We then union all the dataframes with reduce
and group by name
(that's the part where every single dataframe ends up entirely in one row, and therefore one one executor).然后,我们将所有数据帧与
reduce
和按name
分组(这就是每个 dataframe 最终完全排成一行的部分,因此一个执行器)。
dfs
.toSeq
.map{ case (name, df) => df
.select(map(
df.columns.flatMap(c => Seq(lit(c), col(c).cast("string"))) : _*
) as "big_map")
.withColumn("df", lit(name))}
.reduce(_ union _)
.groupBy("df")
.agg(collect_list('big_map) as "column1")
.show(false)
+---+-----------------------------------------------------------------------------------+
|df |column1 |
+---+-----------------------------------------------------------------------------------+
|df0|[{id -> 1, name -> aaa, age -> 23}, {id -> 2, name -> bbb, age -> 34}] |
|df1|[{game -> ludo, time -> 10, score -> 20}, {game -> rummy, time -> 10, score -> 40}]|
+---+-----------------------------------------------------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.