Spark（Scala）如何通过“键”访问 dataframe 中的特定行并修改它

Question

我有两个数据框，一个看起来像这样

+------------------------------------------------------------+
|docs                                                        |
+------------------------------------------------------------+
|{doc1.txt -> 1, doc2.txt -> 3, doc3.txt -> 5, doc4.txt -> 1}|
|{doc1.txt -> 2, doc2.txt -> 2, doc3.txt -> 4}               |
|{doc1.txt -> 3, doc2.txt -> 2, doc4.txt -> 2}               |
+------------------------------------------------------------+

和其他喜欢

+--------------+----------+
|      Document|doc_length|
+--------------+----------+
|      doc1.txt|         0|
|      doc2.txt|         0|
|      doc3.txt|         0|
|      doc3.txt|         0|
|      doc4.txt|         0|
+-------------------------+

例如，文档是有序的，但在我的用例中，我不能期望它们是有序的。

现在我想遍历第一个 dataframe 并将第二个中的值更新为 go。 我有一个这样的循环

df1.foreach(r =>
      for (keyValPair <- r(0).asInstanceOf[Map[String, Long]]) {
        // Something needs to happen here
      } )

In every iteration I want to take take the key of the key-value-pair to select a specific row in the second dataframe and then add the value to the doc_length , so my final output of df2.show() would look like EDIT:后来我可能想在这里做其他更复杂的数学运算，然后只是将所有值相加，这就是我尝试使用上述结构的原因

+--------------+----------+
|      Document|doc_length|
+--------------+----------+
|      doc1.txt|         6|
|      doc2.txt|         7|
|      doc3.txt|         9|
|      doc4.txt|         0|
+-------------------------+

这看起来应该不太难，但我不知道如何通过使用特定列作为键来访问 dataframe 的特定行并更改它们

Answer 1

您可以分解 map 列并按键分组以总结长度：

val df2 = df.select(explode(col("val")))
    .groupBy(col("key").as("document"))
    .agg(sum("value").as("doc_length"))

df2.show
+--------+----------+
|document|doc_length|
+--------+----------+
|doc1.txt|         6|
|doc4.txt|         3|
|doc3.txt|         9|
|doc2.txt|         7|
+--------+----------+

Spark（Scala）如何通过“键”访问 dataframe 中的特定行并修改它

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-06-14 12:19:59

Spark（Scala）如何通过“键”访问 dataframe 中的特定行并修改它

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-06-14 12:19:59

解决方案1
1 已采纳 2021-06-14 12:19:59