如何基于Spark-scala中的过滤器将数据集分为两部分

Question

Is it possible to divide DF in two parts using single filter operation.For example 是否可以使用单个滤波器将DF分为两部分？

let say df has below records 假设df有以下记录

UID    Col
 1       a
 2       b
 3       c

if I do 如果我做

df1 = df.filter(UID <=> 2)

can I save filtered and non-filtered records in different RDD in single operation ? 如何在一次操作中将过滤后的记录和未过滤的记录保存在不同的RDD中？

 df1 can have records where uid = 2
 df2 can have records with uid 1 and 3

Answer 1

If you're interested only in saving data you can add an indicator column to the DataFrame : 如果您只对保存数据感兴趣，可以向DataFrame添加一个指示符列：

val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col")
val dfWithInd = df.withColumn("ind", $"uid" <=> 2)

and use it as a partition column for the DataFrameWriter with one of the supported formats (as for 1.6 it is Parquet, text, and JSON): 并将其用作DataFrameWriter的分区列，并使用一种受支持的格式（从1.6开始，它是Parquet，text和JSON）：

dfWithInd.write.partitionBy("ind").parquet(...)

It will create two separate directories ( ind=false , ind=true ) on write. 它将在写入时创建两个单独的目录（ ind=false ， ind=true ）。

In general though, it is not possible to yield multiple RDDs or DataFrames from a single transformation. 但是，一般而言，不可能从单个转换中产生多个RDDs或DataFrames 。 See How to split a RDD into two or more RDDs? 请参阅如何将RDD分为两个或多个RDD？

如何基于Spark-scala中的过滤器将数据集分为两部分

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-06-17 00:22:35

如何基于Spark-scala中的过滤器将数据集分为两部分

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-06-17 00:22:35

解决方案1
4 已采纳 2016-06-17 00:22:35