简体   繁体   English

用户定义的函数会破坏pyspark数据帧

[英]User Defined Function breaks pyspark dataframe

My spark version is 1.3, I am using pyspark. 我的火花版是1.3,我正在使用pyspark。

I have a large dataframe called df. 我有一个名为df的大型数据框。

from pyspark import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.parquetFile("events.parquet")

I then select a few columns of the dataframe and try to count the number of rows. 然后,我选择数据帧的几列,并尝试计算行数。 This works fine. 这很好用。

df3 = df.select("start", "end", "mrt")
print(type(df3))
print(df3.count())

I then apply a user defined function to convert one of the columns from a string to a number, this also works fine 然后我应用用户定义的函数将其中一个列从字符串转换为数字,这也可以正常工作

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import LongType
CtI = UserDefinedFunction(lambda i: int(i), LongType())
df4 = df2.withColumn("mrt-2", CtI(df2.mrt))

However if I try to count the number of rows I get an exception even though the type shows that it is a dataframe just like df3. 但是,如果我尝试计算行数,我会得到一个异常,即使该类型显示它是一个像df3一样的数据帧。

print(type(df4))
print(df4.count())

My Error: 我的错误:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-10-53941e183807> in <module>()
      8 df4 = df2.withColumn("mrt-2", CtI(df2.mrt))
      9 print(type(df4))
---> 10 print(df4.count())
     11 df3 = df4.select("start", "end", "mrt-2").withColumnRenamed("mrt-2", "mrt")

/data/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/python/pyspark/sql/dataframe.py in count(self)
    299         2L
    300         """
--> 301         return self._jdf.count()
    302 
    303     def collect(self):

/data/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539 
    540         for temp_arg in temp_args:

/data/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling o152.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1379 in stage 12.0 failed 4 times, most recent failure: Lost task 1379.3 in stage 12.0 (TID 27021, va1ccogbds01.lab.ctllabs.io): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/data/0/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/spark-assembly-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar/pyspark/worker.py", line 101, in main
    process()
  File "/data/0/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/spark-assembly-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/data/0/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/spark-assembly-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar/pyspark/serializers.py", line 236, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/data/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/python/pyspark/sql/functions.py", line 119, in <lambda>
  File "<ipython-input-10-53941e183807>", line 7, in <lambda>
TypeError: int() argument must be a string or a number, not 'NoneType'

at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:98)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:94)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:743)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:127)
at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:124)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1210)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1198)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1198)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1400)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1361)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
---------------------------------------------------------------------------

Am I using the user defined function correctly? 我正确使用用户定义的功能吗? Any idea why the data frame functions don't work on the data frame? 知道为什么数据帧功能不适用于数据帧吗?

From the stack trace, it looks like your column contains a None value which is breaking the int cast; 从堆栈跟踪看起来,您的列包含一个None值,它打破了int cast; you could try changing your lambda function to lambda i: int(i) if i else None , to handle this situation. 你可以尝试将lambda函数更改为lambda i: int(i) if i else None ,以处理这种情况。

Note that just because df2.withColumn("mrt-2", CtI(df2.mrt)) didn't throw an error doesn't mean that your code is fine: Spark has lazy evaluation, so it won't actually try and run your code until you call count , collect or something like that. 请注意,仅仅因为df2.withColumn("mrt-2", CtI(df2.mrt))没有抛出错误并不意味着你的代码没问题:Spark有懒惰的评估,所以它实际上不会尝试和运行你的代码,直到你调用countcollect或类似的东西。

Are you using spark-notebook? 你在使用spark-notebook吗? I used to hit the same error in spark-notebook. 我曾经在spark-notebook中遇到同样的错误。 But the same code runs well in spark-submit 但是相同的代码在spark-submit中运行良好

spark-submit YOURFILE.py spark-submit YOURFILE.py

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 基于数据块上另一个 pyspark 数据帧的某些列,在大型 pyspark 数据帧的列上执行用户定义的函数 - Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks pandas 数据框上的用户定义函数 - User defined function on pandas dataframe 用户在 DataFrame 列上定义的 Function - User Defined Function on a column of DataFrame 将用户定义的 function 应用到 dataframe - applying a user defined function to a dataframe INNER在用户定义的函数上加入两个PySpark数据框[下一个日期] - INNER join two PySpark dataframe on the user defined functions [next Date] 在数据框中使用用户定义的值添加新列。 (pyspark) - Add a new column in dataframe with user defined values. (pyspark) 尝试通过数据框在Pyspark中执行用户定义的函数时出错 - Error when trying to execute User Defined Functions in Pyspark over a Dataframe pyspark variable not defined error using window function in dataframe select operation - pyspark variable not defined error using window function in dataframe select operation 用户定义的Spark数据框聚合函数(Python) - User defined aggregation function for spark dataframe (python) Python Pandas DataFrame 用户定义的 Z86408593C34AF77FDD90DF932F8B526Z 转换 - Python Pandas DataFrame User Defined Function Transformations
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM