简体   繁体   English

现有功能用作UDF修改Spark数据框列时出错

[英]Error when existing function is used as UDF to modify a Spark Dataframe Column

I have a dataframe with a column of type string containing plain text and I would like to modify this column using pyspark.sql.functions.udf (or pyspark.sql.functions.UserDefinedFunction ?). 我有一个数据框,其列类型为包含纯文本的字符串类型,我想使用pyspark.sql.functions.udf (或pyspark.sql.functions.UserDefinedFunction ?)来修改此列。

I am using Python 2.7, Pyspark 1.6.1, and Flask 0.10.1 on OSX 10.11.4. 我在OSX 10.11.4上使用Python 2.7,Pyspark 1.6.1和Flask 0.10.1。

It seems to work fine when I am using a lambda expression: 当我使用lambda表达式时,它似乎工作正常:

@spark.route('/')
def run():
    df = ... # my dataframe
    myUDF = udf(lambda r: len(r),  IntegerType())
    df = df.withColumn('new_'+column, myUDF(df[column]))
    return render_template('index.html', data=df.take(1000))

As soon as I try to move the lambda expression into a named function: 一旦我尝试将lambda表达式移动到命名函数中:

def my_function(x):
    return len(x)

@spark.route('/')
def run():
    df = ... # my dataframe
    myUDF = udf(my_function,  IntegerType())
    df = df.withColumn('new_'+column, myUDF(df[column]))
    return render_template('index.html', data=df.take(1000))

I get the following error: 我收到以下错误:

Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
    command = pickleSer._read_with_length(infile)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
  File "app/__init__.py", line 19, in <module>
    from app.controllers.main import main
  File "app/controllers/main/__init__.py", line 5, in <module>
    import default, source
  File "app/controllers/main/default.py", line 3, in <module>
    from app.controllers.main.source import file
  File "app/controllers/main/source/__init__.py", line 2, in <module>
    import file, online, database
  File "app/controllers/main/source/database.py", line 1, in <module>
    from app.controllers.spark import sqlContext
  File "app/controllers/spark/__init__.py", line 18, in <module>
    import default, grid #, pivot
  File "app/controllers/spark/default.py", line 2, in <module>
    from app.controllers.spark import spark, sc, sqlContext, grid as gridController
  File "app/controllers/spark/grid.py", line 14, in <module>
    from pyspark.ml import Pipeline
  File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/__init__.py", line 18, in <module>
  File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 23, in <module>
  File "/opt/spark/python/lib/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>
ImportError: No module named numpy

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:398)
    at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:363)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Numpy is installed. Numpy已安装。 Removing the mllib imports did not solve the problem. 删除mllib导入并不能解决问题。

It normally works if you declare all the body of 'my_function' inside the body of the 'run' function. 如果您在“运行”函数的主体内声明所有“ my_function”的主体,则通常可以使用。 Otherwise I didn't find yet on how to call an external function exactly like in your case. 否则,我还没有找到完全像您的情况那样调用外部函数的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Udf 函数在输入中带有 Dataframe - Spark Udf function with Dataframe in input 当输入参数是从数据帧的两列连接的值时,Spark UDF 错误 - Spark UDF error when input parameter is a value concatenated from two columns of a dataframe 如何将Spark Dataframe列的每个值作为字符串传递给python UDF? - How to pass each value of Spark Dataframe column as string to python UDF? java.lang.IllegalArgumentException 将 Python UDF 应用于 Spark dataframe - java.lang.IllegalArgumentException when applying a Python UDF to a Spark dataframe 为什么在尝试显示 dataframe 时出现属性错误,其中包含由 UDF function 创建的列? - why do I have attribute error while trying to display dataframe which has column created by UDF function in it? Python 函数上的 Spark UDF - Spark UDF on Python function 如何修改传递给函数的现有数据框 - How to Modify the existing dataframe passed to a function Dataframe 上的 Pyspark UDF 列 - Pyspark UDF column on Dataframe 当连接键作为列表给出时,如何修改火花数据框中的连接列? - How to modify a column for a join in spark dataframe when the join key are given as a list? Pyspark TypeError:在数据帧列上应用 UDF 时,“NoneType”对象不可调用 - Pyspark TypeError: 'NoneType' object is not callable when applying a UDF on dataframe column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM