![](/img/trans.png)
[英]Convert each value of Java spark Dataset into a row using explode()
[英]Is there a way to modify each grouped dataset as a whole in Spark?
我有這個數據集,我想要一種更靈活的方式來分組和編輯分組數據。 例如,我想從此數據集中的每組名稱中刪除第二個 Random_Text,並連接文本的 rest:
以隨機數據集為例
+-------+-----------+
| Names|Random_Text|
+-------+-----------+
|Michael| Hello|
| Jim| Good|
| Bob| How|
|Michael| Good|
|Michael| Morning|
| Bob| Are|
| Bob| You|
| Bob| Doing|
| Jim| Bye|
+-------+-----------+
我想讓數據集看起來像這樣:
+-------+-------------+
| Names| Random_Text|
+-------+-------------+
|Michael|Hello Morning|
| Jim| Good|
| Bob|How You Doing|
+-------+-------------+
我想我需要定義某種自定義的 userdefinedaggregatefunction,但我想不出在 Java 中會是什么樣子。我查看了文檔,但找不到任何在 Java 中有意義的具體內容: https://spark .apache.org/docs/3.0.2/api/java/org/apache/spark/sql/functions.html https://docs.databricks.com/udf/aggregate-scala.html
Dataset<Row> random_text = dtf.groupBy(col("Names")).apply(???)
Dataset<Row> random_text = dtf.groupBy(col("Names")).agg(???)
您可以使用 Window function row_number
從每個組中識別第二個 Random_Text,然后對其進行過濾。
所需進口:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.expressions.*;
import static org.apache.spark.sql.functions.*;
代碼:
Dataset<Row> df = // input;
df.withColumn("rn",
row_number().over(Window.partitionBy("Names").orderBy("Names")))
.where("rn <> 2")
.groupBy("Names")
.agg(concat_ws(" ", collect_list("Random_Text")).as("Random_Text"))
.show();
+-------+-------------+
| Names| Random_Text|
+-------+-------------+
| Jim| Good|
|Michael|Hello Morning|
| Bob|How You Doing|
+-------+-------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.