简体   繁体   English

Pandas UDF 的 PySpark 环境设置

[英]PySpark Environment Setup for Pandas UDF

-EDIT- -编辑-

This simple example just shows 3 records but I need to do this for billions of records so I need to use a Pandas UDF rather than just converting the Spark DF to a Pandas DF and using a simple apply.这个简单的示例只显示了 3 条记录,但我需要为数十亿条记录执行此操作,因此我需要使用 Pandas UDF 而不是仅将 Spark DF 转换为 Pandas DF 并使用简单的应用程序。

Input Data输入数据

在此处输入图片说明

Desired Output期望输出

在此处输入图片说明

-END EDIT- -结束编辑-

I've been banging my head against a wall trying to solve this and I'm hoping someone can help me with this.我一直在努力解决这个问题,我希望有人能帮助我解决这个问题。 I'm trying to convert latitude / longitude values in a PySpark dataframe to a Uber's H3 hex system.我正在尝试将 PySpark 数据帧中的纬度/经度值转换为 Uber 的 H3 十六进制系统。 This is a pretty straightforward use of the function h3.geo_to_h3(lat=lat, lng=lon, resolution=7) .这是函数h3.geo_to_h3(lat=lat, lng=lon, resolution=7)一个非常简单的用法。 However I keep having issues with my PySpark Cluster.但是,我的 PySpark 集群一直有问题。

I'm setting up my PySpark cluster as described in the databricks article here , using the following commands:我正在按照此处的 databricks 文章中的描述设置我的 PySpark 集群,使用以下命令:

  1. conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas h3 numpy python=3.7 conda-pack
  2. conda init --all then closing and reopening the terminal window conda init --all然后关闭并重新打开终端窗口
  3. conda activate pyspark_conda_env
  4. conda pack -f -o pyspark_conda_env.tar.gz

I include the tar.gz file I created when creating my spark cluster in my jupyter notebook like so spark = SparkSession.builder.master("yarn").appName("test").config("spark.yarn.dist.archives","<path>/pyspark_conda_env.tar.gz#environment").getOrCreate()我包括在我的 jupyter 笔记本中创建火花集群时创建的 tar.gz 文件,例如spark = SparkSession.builder.master("yarn").appName("test").config("spark.yarn.dist.archives","<path>/pyspark_conda_env.tar.gz#environment").getOrCreate()

I have my pandas udf set up like this which I was able to get working on a single node spark cluster but am now having trouble on a cluster with multiple worker nodes:我的 Pandas udf 设置如下,我能够在单节点 Spark 集群上工作,但现在在具有多个工作节点的集群上遇到问题:

#create udf to convert lat lon to h3 hex
def convert_to_h3(lat : pd.Series, lon : pd.Series) -> pd.Series:
    import h3 as h3
    import numpy as np
    if ((None in [lat, lon]) | (np.isnan(lat))):
        return None
    else:
        return (h3.geo_to_h3(lat=lat, lng=lon, resolution=7))

@f.pandas_udf('string', f.PandasUDFType.SCALAR)
def udf_convert_to_h3(lat : pd.Series, lon : pd.Series) -> pd.Series:
    import pandas as pd
    import numpy as np
    df = pd.DataFrame({'lat' : lat, 'lon' : lon})
    df['h3_res7'] = df.apply(lambda x : convert_to_h3(x['lat'], x['lon']), axis = 1)
    return df['h3_res7']

After creating the new column with the pandas udf and trying to view it:使用 pandas udf 创建新列并尝试查看后:

trip_starts = trip_starts.withColumn('h3_res7', udf_convert_to_h3(f.col('latitude'), f.col('longitude')))

I get the following error:我收到以下错误:

21/07/15 20:05:22 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 139 for reason Container marked as failed: container_1626376534301_0015_01_000158 on host: ip-xx-xxx-xx-xxx.aws.com. Exit status: -100. Diagnostics: Container released on a *lost* node.

I'm not sure what to do here as I've tried scaling down the number of records to a more manageable number and am still running into this issue.我不知道在这里做什么,因为我已经尝试将记录数量缩小到一个更易于管理的数字,但仍然遇到了这个问题。 Ideally I would like to figure out how to use the PySpark environments as described in the databricks blog post I linked rather than running a bootstrap script when spinning up the cluster due to company policies making bootstrap scripts more difficult to run.理想情况下,我想弄清楚如何使用我链接的 databricks 博客文章中描述的 PySpark 环境,而不是在启动集群时运行引导脚本,因为公司政策使引导脚本更难运行。

I ended up solving this by repartitioning my data into smaller partitions with fewer records in each partition.我最终通过将我的数据重新分区到每个分区中记录较少的较小分区来解决这个问题。 This solved the problem for me.这为我解决了这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM