简体   繁体   English

删除两个重复的行

[英]Remove both duplicates row

Good day colleagues. 大家好。 I have a big dataset (about 237 000 000 rows). 我有一个大数据集(约2.370亿行)。 There are a lot of columns. 有很多列。 For example, I need to delete all duplicates with column names userId , VTS . 例如,我需要删除所有列名称为userIdVTS重复项。

userId Vts moreColumn1 moreColumn2
10     150     2           3              -delete
11     160     1           6
10     150     0           1              -delete

I am bad with SQL. 我对SQL不好。 Have tried different variant from Internet, but it doesn't work. 尝试了与Internet不同的变体,但不起作用。

UPDATE: 更新:

Ty for answers! 输入答案! I forgot to say i use java. 我忘了说我使用Java。 There is my optimized code for java: 有我针对Java的优化代码:

viewingDataset.groupBy("userId", "VTS")
                .count()
                .where("count = 1")
                .drop("count")
                .join(viewingDataset, JavaConversions.asScalaBuffer(asList("userId", "VTS")))

You can aggregate with count, filter the result and join back 您可以汇总计数,过滤结果并重新加入

df.groupBy("userId", "Vts").count
  .where($"count" === 1)
  .drop("count")
  .join(df, Seq("userId", "Vts"))

It is possible to get the same result with window functions, but it less robust if data is skewed and on average much more expensive. 窗口函数可能会获得相同的结果,但是如果数据偏斜,它的鲁棒性就会降低,并且平均而言会更加昂贵。

You can achieve what you want with Window functions: 您可以使用Window函数实现所需的功能:

import org.apache.spark.sql.expressions.Window._

ds.withColumn("aux", count("*")
.over(Window.partitionBy($"userId", $"VTS")))
.where($"aux"===1)
.drop($"aux")

partitionBy will count how many elements are by partition according to the columns you are sending as paremeters (userId and VTS in your example). partitionBy将根据您作为参数发送的列(在您的示例中为userId和VTS)计算按分区划分的元素数量。 Then with the where clause we'll keep only the rows from partitions where the count is 1, ie the unique rows. 然后使用where子句,仅保留计数为1的分区中的行,即唯一行。

Result of the partitionBY clause partitionBY子句的结果

ds.withColumn("aux", count("*").over(Window.partitionBy($"userId", $"VTS"))).show

+-------+----+------------+------------+---+
| userId| VTS| moreColumn1| moreColumn2|aux|
+-------+----+------------+------------+---+
|     10| 150|           2|           3|  2|
|     10| 150|           0|           1|  2|
|     11| 160|           1|           6|  1|
+-------+----+------------+------------+---+

Final Result 最后结果

+-------+----+------------+------------+
| userId| VTS| moreColumn1| moreColumn2|
+-------+----+------------+------------+
|     11| 160|           1|           6|
+-------+----+------------+------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM