[英]Filter spark RDD with PySpark by column name and its numerical value
I am translating Scala / Spark model into Python / Spark. 我正在将Scala / Spark模型转换为Python / Spark。 The problem is I have RDD with about 1 million observations and about 33 columns.
问题是我的RDD具有约100万个观测值和约33列。 I am splitting the RDD based on numerical threshold ('Time').
我正在基于数字阈值(“时间”)划分RDD。 The time variable is in numerical format (double) (not posix).
时间变量为数字格式(双精度)(非posix)。
Here is Scala source code: 这是Scala源代码:
// get the time to split the data.
val splitTime = data.stat.approxQuantile("Time", Array(0.7), 0.001).head
val trainingData = data.filter(s"Time<$splitTime").cache()
val validData = data.filter(s"Time>=$splitTime").cache()
and here is my PySpark failed interpretation: 这是我的PySpark失败的解释:
splitTime = data.approxQuantile("Time", [0.7], 0.001)
trainingData = data.filter(data["Time"] < splitTime)
validData = data.filter(data["Time"] >= splitTime)
The first line works fine. 第一行工作正常。 The problem starts when I try to use the threshold on the RDD.
当我尝试在RDD上使用阈值时,问题就开始了。 I also couldn't decode the Scala format
s" >=$ "
around the condition and its importance in the condition. 我也无法解码条件周围的Scala格式
s" >=$ "
及其在条件中的重要性。 Internet source on the meaning of s" >=$ "
are vague. 互联网上关于
s" >=$ "
的含义含糊。
approxQuantile
returns either List[float]
(single column case like here) or List[List[float]]
(multi columns case) so you have to extract the values: approxQuantile
返回List[float]
(如此处的单列大小写)或List[List[float]]
(多列的大小写),因此您必须提取值:
splitTime = data.approxQuantile("Time", [0.7], 0.001)
data.filter(data["Time"] < splitTime[0])
or 要么
(litTime, ) = data.approxQuantile("a", [0.7], 0.001)
trainingData = data.filter(data["Time"] < splitTime)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.