[英]How to check in Python if cell value of pyspark dataframe column in UDF function is none or NaN for implementing forward fill?
I am basically trying to do a forward fill imputation. 我基本上是在尝试进行正向填充估算。 Below is the code for that.
下面是该代码。
df = spark.createDataFrame([(1,1, None), (1,2, 5), (1,3, None), (1,4, None), (1,5, 10), (1,6, None)], ('session',"timestamp", "id"))
PRV_RANK = 0.0
def fun(rank):
########How to check if None or Nan? ###############
if rank is None or rank is NaN:
return PRV_RANK
else:
PRV_RANK = rank
return rank
fuN= F.udf(fun, IntegerType())
df.withColumn("ffill_new", fuN(df["id"])).show()
I am getting weird error in the log. 我在日志中收到奇怪的错误。
Edit: The question is related to how to identify null & nan in spark dataframe using python. 编辑:问题与如何使用python识别spark数据框中的null和nan有关。
Edit: I am assuming the below line of code which checks for NaN & Null is causing the issue. 编辑:我假设下面的代码行检查NaN和Null导致此问题。 So I have given the title accordingly for this question.
因此,我已为该问题给出了相应的标题。
Traceback (most recent call last):
追溯(最近一次通话):
File "", line 1, in df_na.withColumn("ffill_new", forwardFill(df_na["id"])).show()
df_na.withColumn(“ ffill_new”,forwardFill(df_na [“ id”]))。show()中的文件“”,第1行
File "C:\\Spark\\python\\pyspark\\sql\\dataframe.py", line 318, in show print(self._jdf.showString(n, 20))
文件``C:\\ Spark \\ python \\ pyspark \\ sql \\ dataframe.py'',行318,在show print(self._jdf.showString(n,20))中
File "C:\\Spark\\python\\lib\\py4j-0.10.4-src.zip\\py4j\\java_gateway.py", line 1133, in call answer, self.gateway_client, self.target_id, self.name)
在呼叫应答中的文件“ C:\\ Spark \\ python \\ lib \\ py4j-0.10.4-src.zip \\ py4j \\ java_gateway.py”第1133行,self.gateway_client,self.target_id,self.name)
File "C:\\Spark\\python\\pyspark\\sql\\utils.py", line 63, in deco return f(*a, **kw)
文件“ C:\\ Spark \\ python \\ pyspark \\ sql \\ utils.py”,第63行,以装饰性返回f(* a,** kw)
File "C:\\Spark\\python\\lib\\py4j-0.10.4-src.zip\\py4j\\protocol.py", line 319, in get_return_value format(target_id, ".", name), value)
文件“ C:\\ Spark \\ python \\ lib \\ py4j-0.10.4-src.zip \\ py4j \\ protocol.py”,第319行,以get_return_value格式(target_id,“。”,名称),值)
Py4JJavaError: An error occurred while calling o806.showString.
Py4JJavaError:调用o806.showString时发生错误。 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 47.0 failed 1 times, most recent failure: Lost task 0.0 in stage 47.0 (TID 83, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "C:\\Spark\\python\\lib\\pyspark.zip\\pyspark\\worker.py", line 174, in main File "C:\\Spark\\python\\lib\\pyspark.zip\\pyspark\\worker.py", line 169, in process File "C:\\Spark\\python\\lib\\pyspark.zip\\pyspark\\worker.py", line 106, in File "C:\\Spark\\python\\lib\\pyspark.zip\\pyspark\\worker.py", line 92, in File "C:\\Spark\\python\\lib\\pyspark.zip\\pyspark\\worker.py", line 70, in File "", line 5, in forwardfil UnboundLocalError: local variable 'PRV_RANK' referenced before assignment
:org.apache.spark.SparkException:由于阶段失败导致作业中止:阶段47.0中的任务0失败1次,最近一次失败:阶段47.0中丢失了任务0.0(TID 83,本地主机,执行程序驱动程序):org.apache.spark .api.python.PythonException:追溯(最近一次调用为上):文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行174,在主文件“ C:\\ Spark \\ python”中\\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行169,正在处理文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行106,文件“ C:\\”文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”第70行,文件“”中的Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,第92行,在forwardfil UnboundLocalError中的第5行:分配前引用了本地变量'PRV_RANK'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.sp
在org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRDD.scala:193)在org.apache.spark.api.python.PythonRunner $$ anon $ 1。(PythonRDD.scala:234)在org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.apply(BatchEvalPythonExec.scala:144)上的org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.apply(BatchEvalPythonExec.scala:87)at org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23 org.apache.spark.rdd的.apply(RDD.scala:797).org.apache.spark.rdd.MapPartitionsRDD的.apply(RDD.scala:797)在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:797) org.apache.spark.rdd.RDP.compute(MapPartitionsRDD.scala:38)org.org.apache.spark.rdd.RDD.iterator(RDD.scala:287)的org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)。 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)位于org.apache.sp处的apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) ark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala)上的org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)上的ark.rdd.RDD.iterator(RDD.scala:287): 323),位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87),位于org.apache.spark.scheduler.ResultTask(RDD.scala:287),位于org.apache.spark.scheduler.Task在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:322)在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)在.run(Task.scala:99) util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)在java.lang.Thread.run(Thread.java:748)
Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.
驱动程序堆栈跟踪:org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1435)在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1422)处应用(DAGScheduler.scala:1423)在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala: 59)在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)在org.apache.spark.scheduler.DAGScheduler $$ org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802)上的anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802)在scala.Option.foreach(Option.scala:257) ),位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)处。 scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2386) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2788) at org.apache.spark.sql.Dataset.org
scala:1650)位于org.apache.spark.org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605),org.org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)位于org.apache.spark。 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)上的.EventLoop $$ anon $ 1.run(EventLoop.scala:48)在org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) )于org.apache.spark.SparkContext.runa(1938)于org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)于org.apache.spark.sql.execution.SparkPlan.executeTake( org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)的SparkPlan.scala:333),org.apache.spark.sql.Dataset $$ anonfun $ org $ apache $ spark $ sql $ Dataset org.apache.spark.sql.Dataset.withNewExecutionId(Dataset)的$$ execute $ 1 $ 1.apply(Dataset.scala:2386)位于org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:57) .scala:2788),位于org.apache.spark.sql.Dataset.org $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2385) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2392) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2128) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2127) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2818) at org.apache.spark.sql.Dataset.head(Dataset.scala:2127) at org.apache.spark.sql.Dataset.take(Dataset.scala:2342) at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j
$ apache $ spark $ sql $ Dataset $$ execute $ 1(Dataset.scala:2385)at org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ collect(Dataset.scala:2392)at org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2128)位于org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2127)位于org.apache.spark.sql.Dataset.take(org.apache.spark.sql.Dataset.head(Dataset.scala:2127)的org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2818)位于org.apache.spark.sql.Dataset.showString(Dataset.scala:248)的Dataset.scala:2342)位于sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:sun.reflect.GeneratedMethodAccessor35.invoke(未知源)处43)在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在java.lang.reflect.Method.invoke(Method.java:498) py4j.pys的py4j.Gateway.invoke(Gateway.java:280)py4j的py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) .commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "C:\\Spark\\python\\lib\\pyspark.zip\\pyspark\\worker.py", line 174, in main File "C:\\Spark\\python\\lib\\pyspark.zip\\pyspark\\worker.py", line 169, in process File "C:\\Spark\\python\\lib\\pyspark.zip\\pyspark\\worker.py", line 106, in File "C:\\Spark\\python\\lib\\pyspark.zip\\pyspark\\worker.py", line 92, in File "C:\\Spark\\python\\lib\\pyspark.zip\\pyspark\\worker.py", line 70, in File "", line 5, in forwardfil UnboundLocalError: local variable 'PRV_RANK' referenced before assignment
py4j.GatewayConnection.run(GatewayConnection.java:214)的.commands.CallCommand.execute(CallCommand.java:79),java.lang.Thread.run(Thread.java:748)的原因:org.apache.spark。 api.python.PythonException:追溯(最近一次通话为最后):文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行174,位于主文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行169,正在处理文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行106,文件“ C:\\ Spark”文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”第70行,文件“”中的\\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,第92行,第5行,forwardfil UnboundLocalError:赋值之前引用了本地变量'PRV_RANK'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.sp
在org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRDD.scala:193)在org.apache.spark.api.python.PythonRunner $$ anon $ 1。(PythonRDD.scala:234)在org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.apply(BatchEvalPythonExec.scala:144)上的org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.apply(BatchEvalPythonExec.scala:87)at org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23 org.apache.spark.rdd的.apply(RDD.scala:797).org.apache.spark.rdd.MapPartitionsRDD的.apply(RDD.scala:797)在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:797) org.apache.spark.rdd.RDP.compute(MapPartitionsRDD.scala:38)org.org.apache.spark.rdd.RDD.iterator(RDD.scala:287)的org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)。 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)位于org.apache.sp处的apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) ark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 1 more
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala)上的org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)上的ark.rdd.RDD.iterator(RDD.scala:287): 323),位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87),位于org.apache.spark.scheduler.ResultTask(RDD.scala:287),位于org.apache.spark.scheduler.Task在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:322)在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)在.run(Task.scala:99) util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)另外1个
df.withColumn("ffill_new", f.UserDefinedFunction(lambda x: x or 0, IntegerType())(df["id"])).show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.