[英]How to divide dataset in two parts based on filter in Spark-scala
Is it possible to divide DF in two parts using single filter operation.For example 是否可以使用单个滤波器将DF分为两部分?
let say df has below records 假设df有以下记录
UID Col
1 a
2 b
3 c
if I do 如果我做
df1 = df.filter(UID <=> 2)
can I save filtered and non-filtered records in different RDD in single operation ? 如何在一次操作中将过滤后的记录和未过滤的记录保存在不同的RDD中?
df1 can have records where uid = 2
df2 can have records with uid 1 and 3
If you're interested only in saving data you can add an indicator column to the DataFrame
: 如果您只对保存数据感兴趣,可以向
DataFrame
添加一个指示符列:
val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col")
val dfWithInd = df.withColumn("ind", $"uid" <=> 2)
and use it as a partition column for the DataFrameWriter
with one of the supported formats (as for 1.6 it is Parquet, text, and JSON): 并将其用作
DataFrameWriter
的分区列,并使用一种受支持的格式(从1.6开始,它是Parquet,text和JSON):
dfWithInd.write.partitionBy("ind").parquet(...)
It will create two separate directories ( ind=false
, ind=true
) on write. 它将在写入时创建两个单独的目录(
ind=false
, ind=true
)。
In general though, it is not possible to yield multiple RDDs
or DataFrames
from a single transformation. 但是,一般而言,不可能从单个转换中产生多个
RDDs
或DataFrames
。 See How to split a RDD into two or more RDDs? 请参阅如何将RDD分为两个或多个RDD?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.