简体   繁体   English

按列名及其数值过滤带有PySpark的spark RDD

[英]Filter spark RDD with PySpark by column name and its numerical value

I am translating Scala / Spark model into Python / Spark. 我正在将Scala / Spark模型转换为Python / Spark。 The problem is I have RDD with about 1 million observations and about 33 columns. 问题是我的RDD具有约100万个观测值和约33列。 I am splitting the RDD based on numerical threshold ('Time'). 我正在基于数字阈值(“时间”)划分RDD。 The time variable is in numerical format (double) (not posix). 时间变量为数字格式(双精度)(非posix)。

Here is Scala source code: 这是Scala源代码:

// get the time to split the data.
val splitTime = data.stat.approxQuantile("Time", Array(0.7), 0.001).head

val trainingData = data.filter(s"Time<$splitTime").cache()
val validData = data.filter(s"Time>=$splitTime").cache()

and here is my PySpark failed interpretation: 这是我的PySpark失败的解释:

splitTime = data.approxQuantile("Time", [0.7], 0.001)
trainingData = data.filter(data["Time"] < splitTime)
validData = data.filter(data["Time"] >= splitTime)

The first line works fine. 第一行工作正常。 The problem starts when I try to use the threshold on the RDD. 当我尝试在RDD上使用阈值时,问题就开始了。 I also couldn't decode the Scala format s" >=$ " around the condition and its importance in the condition. 我也无法解码条件周围的Scala格式s" >=$ "及其在条件中的重要性。 Internet source on the meaning of s" >=$ " are vague. 互联网上关于s" >=$ "的含义含糊。

approxQuantile returns either List[float] (single column case like here) or List[List[float]] (multi columns case) so you have to extract the values: approxQuantile返回List[float] (如此处的单列大小写)或List[List[float]] (多列的大小写),因此您必须提取值:

splitTime = data.approxQuantile("Time", [0.7], 0.001)
data.filter(data["Time"] < splitTime[0])

or 要么

(litTime, ) = data.approxQuantile("a", [0.7], 0.001)
trainingData = data.filter(data["Time"] < splitTime)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM