This is my DataFrame
df.groupBy($"label").count.show
+-----+---------+
|label| count|
+-----+---------+
| 0.0|400000000|
| 1.0| 10000000|
+-----+---------+
I am trying to subsample the records with label == 0.0 with the following:
val r = scala.util.Random
val df2 = df.filter($"label" === 1.0 || r.nextDouble > 0.5) // keep 50% of 0.0
My output looks like this:
df2.groupBy($"label").count.show
+-----+--------+
|label| count|
+-----+--------+
| 1.0|10000000|
+-----+--------+
r.nextDouble
is a constant in the expression so the actual evaluation is quite different from what you mean. Depending on the actual sampled value it is either
scala> r.setSeed(0)
scala> $"label" === 1.0 || r.nextDouble > 0.5
res0: org.apache.spark.sql.Column = ((label = 1.0) OR true)
or
scala> r.setSeed(4096)
scala> $"label" === 1.0 || r.nextDouble > 0.5
res3: org.apache.spark.sql.Column = ((label = 1.0) OR false)
so after simplification it is just:
true
(keeping all the records) or
label = 1.0
(keeping only ones, the case you observed) respectively.
To generate random numbers you should use corresponding SQL function
scala> import org.apache.spark.sql.functions.rand
import org.apache.spark.sql.functions.rand
scala> $"label" === 1.0 || rand > 0.5
res1: org.apache.spark.sql.Column = ((label = 1.0) OR (rand(3801516599083917286) > 0.5))
though Spark already provides stratified sampling tools:
df.stat.sampleBy(
"label", // column
Map(0.0 -> 0.5, 1.0 -> 1.0), // fractions
42 // seed
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.