Spark Java API，数据集操作？

Question

I'm new spark Java API. 我是新的Spark Java API。 My dataset contains two columns (account, Lib) . 我的数据集包含两列（帐户，库）。 I want to display accounts having differents lib. 我想显示具有不同lib的帐户。 In fact my dataset is something like this. 实际上我的数据集是这样的。 ds1 DS1

 +---------+------------+
    |  account|    Lib     |
    +---------+------------+
    | 222222  |  bbbb      |
    | 222222  |  bbbb      |
    | 222222  |  bbbb      |
    |         |            |
    | 333333  |  aaaa      |
    | 333333  |  bbbb      |
    | 333333  |  cccc      |
    |         |            |
    | 444444  |  dddd      |
    | 444444  |  dddd      |
    | 444444  |  dddd      |
    |         |            |
    | 555555  |  vvvv      |
    | 555555  |  hhhh      |
    | 555555  |  vvvv      |

I want to get ds2 like this: 我想要这样的ds2：

+---------+------------+
|  account|    Lib     |
+---------+------------+
|         |            |
| 333333  |  aaaa      |
| 333333  |  bbbb      |
| 333333  |  cccc      |
|         |            |
| 555555  |  vvvv      |
| 555555  |  hhhh      |

Answer 1

If groups are small you can use window functions: 如果组很小，则可以使用窗口功能：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window


df
  .withColumn("cnt", approx_count_distinct("Lib").over(Window.partitionBy("account")).alias("cnt"))
  .where(col("cnt") > 1)

If groups are large: 如果组很大：

df.join(
  df
   .groupBy("account")
   .agg(countDistinct("Lib").alias("cnt")).where(col("cnt") > 1),
  Seq("account"),
  "leftsemi"
)

Spark Java API，数据集操作？

问题描述

1 个解决方案

解决方案1
1 2018-06-01 13:31:25

Spark Java API，数据集操作？

问题描述

1 个解决方案

解决方案1 1 2018-06-01 13:31:25

解决方案1
1 2018-06-01 13:31:25