簡體   English   中英

如何在Python中檢查UDF函數中pyspark dataframe列的單元格值是none還是NaN以實現前向填充?

[英]How to check in Python if cell value of pyspark dataframe column in UDF function is none or NaN for implementing forward fill?

我基本上是在嘗試進行正向填充估算。 下面是該代碼。

df = spark.createDataFrame([(1,1, None), (1,2, 5), (1,3, None), (1,4, None), (1,5, 10), (1,6, None)], ('session',"timestamp", "id"))

PRV_RANK = 0.0
def fun(rank):
    ########How to check if None or Nan?  ###############
    if rank is None or rank is NaN:
        return PRV_RANK
    else:
        PRV_RANK = rank
        return rank        

fuN= F.udf(fun, IntegerType())

df.withColumn("ffill_new", fuN(df["id"])).show()

我在日志中收到奇怪的錯誤。

編輯:問題與如何使用python識別spark數據框中的null和nan有關。

編輯:我假設下面的代碼行檢查NaN和Null導致此問題。 因此,我已為該問題給出了相應的標題。

追溯(最近一次通話):

df_na.withColumn(“ ffill_new”,forwardFill(df_na [“ id”]))。show()中的文件“”,第1行

文件``C:\\ Spark \\ python \\ pyspark \\ sql \\ dataframe.py'',行318,在show print(self._jdf.showString(n,20))中

呼叫應答中的文件“ C:\\ Spark \\ python \\ lib \\ py4j-0.10.4-src.zip \\ py4j \\ java_gateway.py”第1133行,self.gateway_client,self.target_id,self.name)

文件“ C:\\ Spark \\ python \\ pyspark \\ sql \\ utils.py”,第63行,以裝飾性返回f(* a,** kw)

文件“ C:\\ Spark \\ python \\ lib \\ py4j-0.10.4-src.zip \\ py4j \\ protocol.py”,第319行,以get_return_value格式(target_id,“。”,名稱),值)

Py4JJavaError:調用o806.showString時發生錯誤。 :org.apache.spark.SparkException:由於階段失敗導致作業中止:階段47.0中的任務0失敗1次,最近一次失敗:階段47.0中丟失了任務0.0(TID 83,本地主機,執行程序驅動程序):org.apache.spark .api.python.PythonException:追溯(最近一次調用為上):文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行174,在主文件“ C:\\ Spark \\ python”中\\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行169,正在處理文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行106,文件“ C:\\”文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”第70行,文件“”中的Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,第92行,在forwardfil UnboundLocalError中的第5行:分配前引用了本地變量'PRV_RANK'

在org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRDD.scala:193)在org.apache.spark.api.python.PythonRunner $$ anon $ 1。(PythonRDD.scala:234)在org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.apply(BatchEvalPythonExec.scala:144)上的org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.apply(BatchEvalPythonExec.scala:87)at org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23 org.apache.spark.rdd的.apply(RDD.scala:797).org.apache.spark.rdd.MapPartitionsRDD的.apply(RDD.scala:797)在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:797) org.apache.spark.rdd.RDP.compute(MapPartitionsRDD.scala:38)org.org.apache.spark.rdd.RDD.iterator(RDD.scala:287)的org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)。 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)位於org.apache.sp處的apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala)上的org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)上的ark.rdd.RDD.iterator(RDD.scala:287): 323),位於org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87),位於org.apache.spark.scheduler.ResultTask(RDD.scala:287),位於org.apache.spark.scheduler.Task在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:322)在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)在.run(Task.scala:99) util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)在java.lang.Thread.run(Thread.java:748)

驅動程序堆棧跟蹤:org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1435)在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1422)處應用(DAGScheduler.scala:1423)在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala: 59)在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)在org.apache.spark.scheduler.DAGScheduler $$ org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802)上的anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802)在scala.Option.foreach(Option.scala:257) ),位於org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)處。 scala:1650)位於org.apache.spark.org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605),org.org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)位於org.apache.spark。 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)上的.EventLoop $$ anon $ 1.run(EventLoop.scala:48)在org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) )於org.apache.spark.SparkContext.runa(1938)於org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)於org.apache.spark.sql.execution.SparkPlan.executeTake( org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)的SparkPlan.scala:333),org.apache.spark.sql.Dataset $$ anonfun $ org $ apache $ spark $ sql $ Dataset org.apache.spark.sql.Dataset.withNewExecutionId(Dataset)的$$ execute $ 1 $ 1.apply(Dataset.scala:2386)位於org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:57) .scala:2788),位於org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ execute $ 1(Dataset.scala:2385)at org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ collect(Dataset.scala:2392)at org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2128)位於org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2127)位於org.apache.spark.sql.Dataset.take(org.apache.spark.sql.Dataset.head(Dataset.scala:2127)的org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2818)位於org.apache.spark.sql.Dataset.showString(Dataset.scala:248)的Dataset.scala:2342)位於sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:sun.reflect.GeneratedMethodAccessor35.invoke(未知源)處43)在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在java.lang.reflect.Method.invoke(Method.java:498) py4j.pys的py4j.Gateway.invoke(Gateway.java:280)py4j的py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) py4j.GatewayConnection.run(GatewayConnection.java:214)的.commands.CallCommand.execute(CallCommand.java:79),java.lang.Thread.run(Thread.java:748)的原因:org.apache.spark。 api.python.PythonException:追溯(最近一次通話為最后):文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行174,位於主文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行169,正在處理文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,行106,文件“ C:\\ Spark”文件“ C:\\ Spark \\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”第70行,文件“”中的\\ python \\ lib \\ pyspark.zip \\ pyspark \\ worker.py”,第92行,第5行,forwardfil UnboundLocalError:賦值之前引用了本地變量'PRV_RANK'

在org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRDD.scala:193)在org.apache.spark.api.python.PythonRunner $$ anon $ 1。(PythonRDD.scala:234)在org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.apply(BatchEvalPythonExec.scala:144)上的org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.apply(BatchEvalPythonExec.scala:87)at org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23 org.apache.spark.rdd的.apply(RDD.scala:797).org.apache.spark.rdd.MapPartitionsRDD的.apply(RDD.scala:797)在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:797) org.apache.spark.rdd.RDP.compute(MapPartitionsRDD.scala:38)org.org.apache.spark.rdd.RDD.iterator(RDD.scala:287)的org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)。 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)位於org.apache.sp處的apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala)上的org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)上的ark.rdd.RDD.iterator(RDD.scala:287): 323),位於org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87),位於org.apache.spark.scheduler.ResultTask(RDD.scala:287),位於org.apache.spark.scheduler.Task在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:322)在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)在.run(Task.scala:99) util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)另外1個

df.withColumn("ffill_new", f.UserDefinedFunction(lambda x: x or 0, IntegerType())(df["id"])).show()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM