简体   繁体   English

并行处理多个数据帧 Scala

[英]Process multiple dataframes in parallel Scala

I am a newbie in Scala-Spark.我是 Scala-Spark 的新手。 I have a dataframe like the one below that I need to split into different chunks of data based into a group ID and process them independently in parallel.我有一个 dataframe,如下所示,我需要根据组 ID 将数据分成不同的数据块,并独立地并行处理它们。

+----+-------+-----+-------+
|user|feature|value|groupID
+----+-------+-----+-------+
|   1|    100|    1|      A|
|   2|    20B|    0|      B|
|   3|    30A|    1|      B|
|   4|    40A|    1|      B| 
|   5|    50A|    1|      A|
|   6|    10A|    0|      B|
|   7|    200|    1|      A|
|   8|    30B|    1|      B|
|   9|    400|    0|      A|
|  10|    50C|    0|      A|
+----+-------+-----+-------+

1 Step I need to split it to have two different df like these ones: I can user a filter for this.第一步我需要将它拆分成两个不同的 df,就像这些:我可以为此使用一个过滤器。 But I am not sure if (due to the large number of different dataframes they will produce) I should save them into ADLS as parquets or keep them in memory.但我不确定(由于它们会产生大量不同的数据帧)我应该将它们作为镶木地板保存到 ADLS 中还是将它们保存在 memory 中。

+----+-------+-----+-------+
|user|feature|value|groupID
+----+-------+-----+-------+
|   1|    100|    1|      A|
|   5|    50A|    1|      A|
|   7|    200|    1|      A|
|   9|    400|    0|      A|
|  10|    50C|    0|      A|
+----+-------+-----+-------+

+----+-------+-----+-------+
|user|feature|value|groupID
+----+-------+-----+-------+
|   2|    20B|    0|      B|
|   3|    30A|    1|      B|
|   4|    40A|    1|      B| 
|   6|    10A|    0|      B|
|   8|    30B|    1|      B|
+----+-------+-----+-------+

2 Step Process independently each dataframe in a parallel fashion and get independent processed dataframes. 2 Step以并行方式独立处理每个 dataframe 并获得独立处理的数据帧。

To give some context:给出一些背景:

  • The number of groupIds will be high therefore they cannot be hardcoded. groupId 的数量会很高,因此不能对其进行硬编码。

  • The processing of each dataframe would ideally happen in parallel.每个 dataframe 的处理最好是并行发生的。

I ask for a brief idea on how to proceed: I have seen.par.foreach (but is not clear to me how to apply this on a dynamic number of dataframes and how to store them independently nor if the best efficient way)我问一个关于如何进行的简要想法:我已经 seen.par.foreach (但我不清楚如何将其应用于动态数量的数据帧以及如何独立存储它们,也不清楚是否是最有效的方法)

Check below code.检查下面的代码。

scala> df.show(false)
+----+-------+-----+-------+
|user|feature|value|groupID|
+----+-------+-----+-------+
|1   |100    |1    |A      |
|2   |20B    |0    |B      |
|3   |30A    |1    |B      |
|4   |40A    |1    |B      |
|5   |50A    |1    |A      |
|6   |10A    |0    |B      |
|7   |200    |1    |A      |
|8   |30B    |1    |B      |
|9   |400    |0    |A      |
|10  |50C    |0    |A      |
+----+-------+-----+-------+

Get distinct groupid values from dataframe.从 dataframe 获取distinct groupid 值。

scala> val groupIds = df.select($"groupID").distinct.as[String].collect // Get distinct group ids.
groupIds: Array[String] = Array(B, A)

Use .par for parallel process.使用.par进行并行处理。 You need add your logic inside map .您需要在map中添加您的逻辑。

scala> groupIds.par.map(groupid => df.filter($"groupId" === lit(groupid))).foreach(_.show(false)) // here you might need add your logic to save or any other inside map function not foreach.., for example I have added logic to show dataframe content in foreach.
+----+-------+-----+-------+
|user|feature|value|groupID|
+----+-------+-----+-------+
|2   |20B    |0    |B      |
|3   |30A    |1    |B      |
|4   |40A    |1    |B      |
|6   |10A    |0    |B      |
|8   |30B    |1    |B      |
+----+-------+-----+-------+

+----+-------+-----+-------+
|user|feature|value|groupID|
+----+-------+-----+-------+
|1   |100    |1    |A      |
|5   |50A    |1    |A      |
|7   |200    |1    |A      |
|9   |400    |0    |A      |
|10  |50C    |0    |A      |
+----+-------+-----+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM