简体   繁体   English

有没有更快的方法将 pyspark dataframe 列转换为 python 列表? (Collect() 非常慢)

[英]is there a faster way to convert a column of pyspark dataframe into python list? (Collect() is very slow )

I am trying to store a column of pyspark dataframe into python list using collect function.我正在尝试将 pyspark dataframe 的一列存储到 python 列表中,使用收集 ZC1C425268E17385D14ZA50744。 eg例如

list_a = [row[column_name] for row in dataset_name.collect()] list_a = [row[column_name] for dataset_name.collect() 中的行]

but this is very slow process and takes more than 10 seconds for a dataframe of 3 columns and 27 rows.但这是一个非常缓慢的过程,对于 3 列和 27 行的 dataframe 需要 10 多秒。

is there a faster way to do so?有更快的方法吗?

I tried caching the data before this step .我尝试在此步骤之前缓存数据 With this step, the above query is being executed in 2 seconds but cache step itself is taking around 7-8 seconds so my purpose of reducing time is not full filled.通过这一步,上述查询将在 2 秒内执行,但缓存步骤本身大约需要 7-8 秒,因此我减少时间的目的并未完全实现。

And my code is such that i need to rebuild the dataframe everytime before this step so need to do cache again so this step(caching the dataframe) is not helping a lot in time reduction.而且我的代码是这样的,我需要在此步骤之前每次都重建 dataframe,因此需要再次进行缓存,因此此步骤(缓存数据帧)对缩短时间没有太大帮助。

Thanks in advance!提前致谢!

Your code can be slightly optimized by only collecting one column of data:您的代码可以通过只收集一列数据来稍微优化:

list_a = [row[column_name] for row in dataset_name.select(column_name).collect()]

This code is cleaner if you use quinn :如果您使用quinn ,则此代码更清晰:

import quinn

list_a = quinn.column_to_list(df, col_name)

collect() transfers all the data to the driver node and is expensive. collect()将所有数据传输到驱动节点,成本很高。 You can only make it faster by collecting less data (eg dataset_name.select(column_name).distinct().collect() would typically be faster).您只能通过收集更少的数据来使其更快(例如dataset_name.select(column_name).distinct().collect()通常会更快)。

Spark is optimized for distributing datasets across a cluster and running computations in parallel. Spark 针对在集群中分布数据集和并行运行计算进行了优化。 The distributed nature of Spark makes computations that collect results on a single node comparatively slow. Spark 的分布式特性使得在单个节点上收集结果的计算速度相对较慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM