在 apache spark sql 的数据帧中避免在具有相同域的多个列的特定情况下进行多次连接

Question

I got asked to do something in apache spark sql (java api), through dataframes, that I think would cost really a lot if performed following a naive approach (I'm still working in the naive approach but I think it would cost a lot since it would need at least 4 sort of joins).我被要求通过数据帧在 apache spark sql (java api) 中做一些事情，我认为如果按照天真的方法执行会花费很多（我仍在使用天真的方法，但我认为这会花费很多因为它至少需要 4 种连接）。

I got the following dataframe:我得到以下数据框：

+----+----+----+----+----+----------+------+
|  C1|  C2|  C3|  C4|  C5|UNIQUE KEY|points|
+----+----+----+----+----+----------+------+
|   A|   A|null|null|null|      1234|     2|
|   A|null|null|   H|null|      1235|     3|
|   A|   B|null|null|null|      1236|     3|
|   B|null|null|null|   E|      1237|     1|
|   C|null|null|   G|null|      1238|     1|
|   F|null|   C|   E|null|      1239|     2|
|null|null|   D|   E|   G|      1240|     1|
+----+----+----+----+----+----------+------+

C1, C2, C3, C4 and C5 have the same domain values, unique key is a unique key, points is an integer that should be considered only once for each distinct value of its corresponding C columns (eg, for first row A,A,null,null,null,key,2 is the same of A,null,null,null,null,key,2 or A,A,A,A,null,key,2) C1、C2、C3、C4 和 C5 具有相同的域值，唯一键是唯一键，点是一个整数，对于其对应的 C 列的每个不同值（例如，对于第一行 A，A ,null,null,null,key,2 与 A,null,null,null,null,key,2 或 A,A,A,A,A,null,key,2) 相同

I got asked to "for each existing C value get the total number of points".我被要求“为每个现有的 C 值获得总点数”。

So the output should be:所以输出应该是：

+----+------+
|  C1|points|
+----+------+
|   A|     8|
|   B|     4|
|   C|     3|
|   D|     1|
|   E|     4|
|   F|     2| 
|   G|     2|
|   H|     3|
+----+------+

I'm was going to separate the dataframe in multiple small ones (1 column for a C column and 1 column for the points) through simple .select("C1","point") , .select("C2","point") and so on.我打算通过简单的.select("C1","point") , .select("C2","point")等。 But I believe that it would really cost a lot if the amount of data is really big, I believe that there should be some sort of trick through map reduce, but I couldn't find one myself since I'm still new to all this world.但是我相信如果数据量真的很大的话真的会花费很多，我相信通过map reduce应该有某种技巧，但是我自己找不到，因为我对这一切都是新手世界。 I think I'm missing some concepts on how to apply a map reduce.我想我缺少一些关于如何应用 map reduce 的概念。

I thought also about using the function explode , I thought putting together [C1, C2, C3, C4, C5] in a column then using explode so I get 5 rows for each row and then I just group by key... but I believe that this would increase the amount of data at some point and if we are talking about GBs this may not be feasible.... I hope you can find the trick that i'm looking for.我还想过使用功能explode ，我想将 [C1, C2, C3, C4, C5] 放在一列中，然后使用explode，所以我每行得到 5 行，然后我只是按键分组......但我相信这会在某些时候增加数据量，如果我们谈论 GB，这可能不可行......我希望你能找到我正在寻找的技巧。

Thanks for your time.谢谢你的时间。

Answer 1

Using explode would probably be the way to go here.使用explode可能是这里的方法。 It won't increase the amount of data and would be a lot more computationally effective as compared to using multiple join (note that a single join by itself is an expensive operation).与使用多个join相比，它不会增加数据量并且在计算上更有效（请注意，单个join本身是一项昂贵的操作）。

In this case, you can convert the columns to an array, retaining only the unique values for each separate row.在这种情况下，您可以将列转换为数组，只保留每个单独行的唯一值。 This array can then be exploded and all nulls filtered away.然后可以分解此数组并过滤掉所有空值。 At this point, a simple groupBy and sum will give you the wanted result.在这一点上，一个简单的groupBy和 sum 会给你想要的结果。

In Scala:在斯卡拉：

df.select(explode(array_distinct(array("C1", "C2", "C3", "C4", "C5"))).as("C1"), $"points")
  .filter($"C1".isNotNull)
  .groupBy($"C1)
  .agg(sum($"points").as("points"))
  .sort($"C1") // not really necessary

This will give you the wanted result:这会给你想要的结果：

+----+------+
|  C1|points|
+----+------+
|   A|     8|
|   B|     4|
|   C|     3|
|   D|     1|
|   E|     4|
|   F|     2| 
|   G|     2|
|   H|     3|
+----+------+

在 apache spark sql 的数据帧中避免在具有相同域的多个列的特定情况下进行多次连接

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-03-10 02:34:49

在 apache spark sql 的数据帧中避免在具有相同域的多个列的特定情况下进行多次连接

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-03-10 02:34:49

解决方案1
1 已采纳 2020-03-10 02:34:49