将 pyspark pandas_udf 与 AWS EMR 一起使用时出现“No module named 'pandas'”错误

Question

I ran the code for this site ( https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#co-grouped-map ) using zeppelin on AWS EMR.我在 AWS EMR 上使用 zeppelin 运行了此站点的代码（ https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#co-grouped-map ）。

%pyspark
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
    df1 = spark.createDataFrame(
        [(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)],
        ("time", "id", "v1"))

df2 = spark.createDataFrame(
    [(20000101, 1, "x"), (20000101, 2, "y")],
    ("time", "id", "v2"))

def asof_join(l, r):
    return pd.merge_asof(l, r, on="time", by="id")

df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
    asof_join, schema="time int, id int, v1 double, v2 string").show()

and got the "ModuleNotFoundError: No module named 'pandas'" error at running last row.并在运行最后一行时出现“ModuleNotFoundError: No module named 'pandas'”错误。 df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( asof_join, schema="time int, id int, v1 double, v2 string").show() df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( asof_join, schema="time int, id int, v1 double, v2 string").show()

> pyspark.sql.utils.PythonException:   An exception was thrown from
> Python worker in the executor. The below is the Python worker
> stacktrace. Traceback (most recent call last):   File
> "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/worker.py",
> line 589, in main
>     func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)   File
> "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/worker.py",
> line 434, in read_udfs
>     arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0)   File
> "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/worker.py",
> line 254, in read_single_udf
>     f, return_type = read_command(pickleSer, infile)   File "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/worker.py",
> line 74, in read_command
>     command = serializer._read_with_length(file)   File "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/serializers.py",
> line 172, in _read_with_length
>     return self.loads(obj)   File "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/serializers.py",
> line 458, in loads
>     return pickle.loads(obj, encoding=encoding)   File "/mnt/yarn/usercache/zeppelin/appcache/application_1765329837897_0004/container_1765329837897_0004_01_000026/pyspark.zip/pyspark/cloudpickle.py",
> line 1110, in subimport
>     __import__(name)
> ModuleNotFoundError: No module named 'pandas'

The version of the library you are using is as follows "pyspark 3.0.0 spark 3.0.0 pyarrow 0.15.1 zeppelin 0.9.0" and set the zeppelin.pyspark.python config property to python3您使用的库版本如下“pyspark 3.0.0 spark 3.0.0 pyarrow 0.15.1 zeppelin 0.9.0”并将 zeppelin.pyspark.python 配置属性设置为 python3

Since pandas was not installed in the original EMR environment, I installed it with the command "sudo python3 -m pip install pandas".由于原来的EMR环境中没有安装pandas，所以我用命令“sudo python3 -m pip install pandas”安装了它。 I have confirmed that if I run the code "import pandas" on zeppelin, it runs fine.我已经确认，如果我在 zeppelin 上运行代码“import pandas”，它运行良好。

However, when I use pandas_udf from pyspark, I get an error pandas cannot be found.但是，当我使用 pyspark 中的 pandas_udf 时，出现错误 pandas 找不到。 Why is this?为什么是这样？ How can I do it correctly?我怎样才能正确地做到这一点？

Answer 1

Writing "sudo python3 -m install pandas" to the shell script for bootsrap action solve this.将“sudo python3 -m install pandas”写入 shell 脚本以进行引导操作可解决此问题。

将 pyspark pandas_udf 与 AWS EMR 一起使用时出现“No module named 'pandas'”错误

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-03-02 11:17:34

将 pyspark pandas_udf 与 AWS EMR 一起使用时出现“No module named 'pandas'”错误

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-03-02 11:17:34

解决方案1
0 已采纳 2021-03-02 11:17:34