按列名及其数值过滤带有PySpark的spark RDD

Question

I am translating Scala / Spark model into Python / Spark. 我正在将Scala / Spark模型转换为Python / Spark。 The problem is I have RDD with about 1 million observations and about 33 columns. 问题是我的RDD具有约100万个观测值和约33列。 I am splitting the RDD based on numerical threshold ('Time'). 我正在基于数字阈值（“时间”）划分RDD。 The time variable is in numerical format (double) (not posix). 时间变量为数字格式（双精度）（非posix）。

Here is Scala source code: 这是Scala源代码：

// get the time to split the data.
val splitTime = data.stat.approxQuantile("Time", Array(0.7), 0.001).head

val trainingData = data.filter(s"Time<$splitTime").cache()
val validData = data.filter(s"Time>=$splitTime").cache()

and here is my PySpark failed interpretation: 这是我的PySpark失败的解释：

splitTime = data.approxQuantile("Time", [0.7], 0.001)
trainingData = data.filter(data["Time"] < splitTime)
validData = data.filter(data["Time"] >= splitTime)

The first line works fine. 第一行工作正常。 The problem starts when I try to use the threshold on the RDD. 当我尝试在RDD上使用阈值时，问题就开始了。 I also couldn't decode the Scala format s" >=$ " around the condition and its importance in the condition. 我也无法解码条件周围的Scala格式s" >=$ "及其在条件中的重要性。 Internet source on the meaning of s" >=$ " are vague. 互联网上关于s" >=$ "的含义含糊。

Answer 1

approxQuantile returns either List[float] (single column case like here) or List[List[float]] (multi columns case) so you have to extract the values: approxQuantile返回List[float] （如此处的单列大小写）或List[List[float]] （多列的大小写），因此您必须提取值：

splitTime = data.approxQuantile("Time", [0.7], 0.001)
data.filter(data["Time"] < splitTime[0])

or 要么

(litTime, ) = data.approxQuantile("a", [0.7], 0.001)
trainingData = data.filter(data["Time"] < splitTime)

按列名及其数值过滤带有PySpark的spark RDD

问题描述

1 个解决方案

解决方案1
1 2017-12-12 19:32:44

按列名及其数值过滤带有PySpark的spark RDD

问题描述

1 个解决方案

解决方案1 1 2017-12-12 19:32:44

解决方案1
1 2017-12-12 19:32:44