繁体   English   中英

如何在Python中检查UDF函数中pyspark dataframe列的单元格值是none还是NaN以实现前向填充?

[英]How to check in Python if cell value of pyspark dataframe column in UDF function is none or NaN for implementing forward fill?

我基本上是在尝试进行正向填充估算。 下面是该代码。

df = spark.createDataFrame([(1,1, None), (1,2, 5), (1,3, None), (1,4, None), (1,5, 10), (1,6, None)], ('session',"timestamp", "id"))

PRV_RANK = 0.0
def fun(rank):
    ########How to check if None or Nan?  ###############
    if rank is None or rank is NaN:
        return PRV_RANK
    else:
        PRV_RANK = rank
        return rank        

fuN= F.udf(fun, IntegerType())

df.withColumn("ffill_new", fuN(df["id"])).show()

我在日志中收到奇怪的错误。

编辑:问题与如何使用python识别spark数据框中的null和nan有关。

编辑:我假设下面的代码行检查NaN和Null导致此问题。 因此,我已为该问题给出了相应的标题。

追溯(最近一次通话):

df_na.withColumn(“ ffill_new”,forwardFill(df_na [“ id”]))。show()中的文件“”,第1行

文件``C:\\ Spark \\ python \\ pyspark \\ sql \\ dataframe.py'',行318,在show print(self._jdf.showString(n,20))中

呼叫应答中的文件“ C:\\ Spark \\ python \\ lib \\ py4j-0.10.4-src.zip \\ py4j \\ java_gateway.py”第1133行,self.gateway_client,self.target_id,self.name)

文件“ C:\\ Spark \\ python \\ pyspark \\ sql \\ utils.py”,第63行,以装饰性返回f(* a,** kw)

文件“ C:\\ Spark \\ python \\ lib \\ py4j-0.10.4-src.zip \\ py4j \\ protocol.py”,第319行,以get_return_value格式(target_id,“。”,名称),值)

Py4JJavaError:调用o806.showString时发生错误。 :org.apache.spark.SparkException:由于阶段失败导致作业中止:阶段47.0中的任务0失败1次,最近一次失败:阶段47.0中丢失了任务0.0(TID 83,本地主机,执行程序驱动程序):org.apache.spark .api.python.PythonException:追溯(最近一次调用为上):文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行174,在主文件“ C:\\ Spark \\ python”中\\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行169,正在处理文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行106,文件“ C:\\”文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”第70行,文件“”中的Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,第92行,在forwardfil UnboundLocalError中的第5行:分配前引用了本地变量'PRV_RANK'

在org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRDD.scala:193)在org.apache.spark.api.python.PythonRunner $$ anon $ 1。(PythonRDD.scala:234)在org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.apply(BatchEvalPythonExec.scala:144)上的org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.apply(BatchEvalPythonExec.scala:87)at org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23 org.apache.spark.rdd的.apply(RDD.scala:797).org.apache.spark.rdd.MapPartitionsRDD的.apply(RDD.scala:797)在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:797) org.apache.spark.rdd.RDP.compute(MapPartitionsRDD.scala:38)org.org.apache.spark.rdd.RDD.iterator(RDD.scala:287)的org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)。 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)位于org.apache.sp处的apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala)上的org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)上的ark.rdd.RDD.iterator(RDD.scala:287): 323),位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87),位于org.apache.spark.scheduler.ResultTask(RDD.scala:287),位于org.apache.spark.scheduler.Task在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:322)在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)在.run(Task.scala:99) util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)在java.lang.Thread.run(Thread.java:748)

驱动程序堆栈跟踪:org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1435)在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1422)处应用(DAGScheduler.scala:1423)在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala: 59)在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)在org.apache.spark.scheduler.DAGScheduler $$ org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802)上的anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802)在scala.Option.foreach(Option.scala:257) ),位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)处。 scala:1650)位于org.apache.spark.org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605),org.org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)位于org.apache.spark。 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)上的.EventLoop $$ anon $ 1.run(EventLoop.scala:48)在org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) )于org.apache.spark.SparkContext.runa(1938)于org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)于org.apache.spark.sql.execution.SparkPlan.executeTake( org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)的SparkPlan.scala:333),org.apache.spark.sql.Dataset $$ anonfun $ org $ apache $ spark $ sql $ Dataset org.apache.spark.sql.Dataset.withNewExecutionId(Dataset)的$$ execute $ 1 $ 1.apply(Dataset.scala:2386)位于org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:57) .scala:2788),位于org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ execute $ 1(Dataset.scala:2385)at org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ collect(Dataset.scala:2392)at org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2128)位于org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2127)位于org.apache.spark.sql.Dataset.take(org.apache.spark.sql.Dataset.head(Dataset.scala:2127)的org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2818)位于org.apache.spark.sql.Dataset.showString(Dataset.scala:248)的Dataset.scala:2342)位于sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:sun.reflect.GeneratedMethodAccessor35.invoke(未知源)处43)在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在java.lang.reflect.Method.invoke(Method.java:498) py4j.pys的py4j.Gateway.invoke(Gateway.java:280)py4j的py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) py4j.GatewayConnection.run(GatewayConnection.java:214)的.commands.CallCommand.execute(CallCommand.java:79),java.lang.Thread.run(Thread.java:748)的原因:org.apache.spark。 api.python.PythonException:追溯(最近一次通话为最后):文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行174,位于主文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行169,正在处理文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行106,文件“ C:\\ Spark”文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”第70行,文件“”中的\\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,第92行,第5行,forwardfil UnboundLocalError:赋值之前引用了本地变量'PRV_RANK'

在org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRDD.scala:193)在org.apache.spark.api.python.PythonRunner $$ anon $ 1。(PythonRDD.scala:234)在org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.apply(BatchEvalPythonExec.scala:144)上的org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.apply(BatchEvalPythonExec.scala:87)at org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23 org.apache.spark.rdd的.apply(RDD.scala:797).org.apache.spark.rdd.MapPartitionsRDD的.apply(RDD.scala:797)在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:797) org.apache.spark.rdd.RDP.compute(MapPartitionsRDD.scala:38)org.org.apache.spark.rdd.RDD.iterator(RDD.scala:287)的org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)。 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)位于org.apache.sp处的apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala)上的org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)上的ark.rdd.RDD.iterator(RDD.scala:287): 323),位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87),位于org.apache.spark.scheduler.ResultTask(RDD.scala:287),位于org.apache.spark.scheduler.Task在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:322)在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)在.run(Task.scala:99) util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)另外1个

df.withColumn("ffill_new", f.UserDefinedFunction(lambda x: x or 0, IntegerType())(df["id"])).show()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM