简体   繁体   English

如何将 Map[string,Dataframe] 填充为 scala 中 Dataframe 中的列

[英]How to populate Map[string,Dataframe] as a column in a Dataframe in scala

I have a Map[String, Dataframe] .我有一个Map[String, Dataframe] I want to combine all the data inside that Map into a single Dataframe.我想将 Map 中的所有数据合并到一个 Dataframe 中。 Can a dataframe have a column of Map datatype? dataframe 是否可以具有 Map 数据类型的列?

def sample(dfs : Map[String,Dataframe]): Dataframe =
{
.........
}

Example:例子:

DF1 DF1

id name age
1  aaa  23
2  bbb  34

DF2 DF2

game  time  score
ludo  10    20
rummy 30    40 

I pass the above two DFs as Map to the function.我将上述两个 DF 作为 Map 传递给 function。 Then put data of the each dataframes into a single column of the output dataframe as json format.然后将每个数据帧的数据放入 output dataframe 的单列中,格式为 json。

out DF出DF

+------+----+------+----+------+----+------+----+------+----+------+------+------+------+
|  column1                                                                              |
+------+----+------+----+------+----+------+----+------+----+------+------+------+------+
| [{"id":"1","name":"aaa","age":"23"},{"id":21","name":"bbb","age":"24"}]               |
| [{"game":"ludo","time":"10","score":"20"},{"game":"rummy","time":"30","score":"40"}]  |
+------+----+------+----+------+----+------+----+------+----+------+------+------+------+

Here is a solution specific to your use-case:这是特定于您的用例的解决方案:

import org.apache.spark.sql._

def sample(dfs : Map[String, DataFrame])(implicit spark: SparkSession): DataFrame =
  dfs
    .values
    .foldLeft(spark.emptyDataFrame)((acc, df) => acc.union(df))

The spark session is required to create the empty DataFrame accumulator to fold on.需要火花 session 来创建可折叠的空 DataFrame 蓄电池。

Alternatively if you can guarantee the Map is non empty.或者,如果您可以保证Map不为空。

def sample(dfs : Map[String, DataFrame]): DataFrame =
  dfs
    .values
    .reduce((acc, df) => acc.union(df))

You are asking to generate one row per dataframe.您要求为每个 dataframe 生成一行。 Be careful, if one of the dataframes is large enough so that it cannot be contained in one single executor, this code will break.请注意,如果其中一个数据帧足够大以至于不能包含在单个执行程序中,则此代码将中断。

Let's first generate data and the map dfs of type Map[String, DataFrame] .让我们首先生成数据和类型为Map[String, DataFrame]的 map dfs

val df1 = Seq((1, "aaa", 23), (2, "bbb", 34)).toDF("id", "name", "age")
val df2 = Seq(("ludo", 10, 20), ("rummy", 10, 40)).toDF("game", "time", "score")
dfs = Seq(df1, df2)

Then, for each dataframe of the map, we generate two columns.然后,对于 map 的每个 dataframe,我们生成两列。 big_map associates each column name of the dataframe to its value (cast in string to have a consistent type). big_map将 dataframe 的每个列名与其值相关联(转换为字符串以具有一致的类型)。 df simply contains the name of the dataframe. df仅包含 dataframe 的名称。 We then union all the dataframes with reduce and group by name (that's the part where every single dataframe ends up entirely in one row, and therefore one one executor).然后,我们将所有数据帧与reduce和按name分组(这就是每个 dataframe 最终完全排成一行的部分,因此一个执行器)。

dfs
    .toSeq
    .map{ case (name, df) => df
        .select(map(
             df.columns.flatMap(c => Seq(lit(c), col(c).cast("string"))) : _*
        ) as "big_map")
        .withColumn("df", lit(name))}
    .reduce(_ union _)
    .groupBy("df")
    .agg(collect_list('big_map) as "column1")
    .show(false)
+---+-----------------------------------------------------------------------------------+
|df |column1                                                                            |
+---+-----------------------------------------------------------------------------------+
|df0|[{id -> 1, name -> aaa, age -> 23}, {id -> 2, name -> bbb, age -> 34}]             |
|df1|[{game -> ludo, time -> 10, score -> 20}, {game -> rummy, time -> 10, score -> 40}]|
+---+-----------------------------------------------------------------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Scala Spark-Map(String,Int)的DataFrame列上的空地图 - Scala Spark - empty map on DataFrame column for map(String, Int) 如何从另一列中的 JSON 字符串元素填充 DataFrame 列 - How to populate DataFrame column from JSON string element in another column 将数据框列值映射到 Scala 字典 - map dataframe column values to a to a scala dictionary 在 Scala 数据框中将字符串列转换为十进制 - cast string column to decimal in scala dataframe spark数据帧scala列中字符串的部分匹配 - Partial match of string in a column of spark dataframe scala 如何将 map 一个 dataframe 中的一列字符串值转换为另一个 dataframe 中的另一列? - How do I map a column of string values in one dataframe to another column in another dataframe? 如何在现有 pandas dataframe 中填充新列 - How to populate a new column in an existing pandas dataframe 如何使用另一个数据框的单元格值填充/填充数据框列 - How to populate/fill a dataframe column with cell values of another dataframe 如何通过将一个列与另一个数据框进行比较来填充列 - How to populate a column in one dataframe by comparing it to another dataframe 如何将变量组织到数据框中,但不填充数据框每一列的变量值? - How to organize variables into dataframe but not populate value of variables per column of dataframe?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM