[英]is there a faster way to convert a column of pyspark dataframe into python list? (Collect() is very slow )
I am trying to store a column of pyspark dataframe into python list using collect function.我正在尝试将 pyspark dataframe 的一列存储到 python 列表中,使用收集 ZC1C425268E17385D14ZA50744。 eg
例如
list_a = [row[column_name] for row in dataset_name.collect()] list_a = [row[column_name] for dataset_name.collect() 中的行]
but this is very slow process and takes more than 10 seconds for a dataframe of 3 columns and 27 rows.但这是一个非常缓慢的过程,对于 3 列和 27 行的 dataframe 需要 10 多秒。
is there a faster way to do so?有更快的方法吗?
I tried caching the data before this step .我尝试在此步骤之前缓存数据。 With this step, the above query is being executed in 2 seconds but cache step itself is taking around 7-8 seconds so my purpose of reducing time is not full filled.
通过这一步,上述查询将在 2 秒内执行,但缓存步骤本身大约需要 7-8 秒,因此我减少时间的目的并未完全实现。
And my code is such that i need to rebuild the dataframe everytime before this step so need to do cache again so this step(caching the dataframe) is not helping a lot in time reduction.而且我的代码是这样的,我需要在此步骤之前每次都重建 dataframe,因此需要再次进行缓存,因此此步骤(缓存数据帧)对缩短时间没有太大帮助。
Thanks in advance!提前致谢!
Your code can be slightly optimized by only collecting one column of data:您的代码可以通过只收集一列数据来稍微优化:
list_a = [row[column_name] for row in dataset_name.select(column_name).collect()]
This code is cleaner if you use quinn :如果您使用quinn ,则此代码更清晰:
import quinn
list_a = quinn.column_to_list(df, col_name)
collect()
transfers all the data to the driver node and is expensive. collect()
将所有数据传输到驱动节点,成本很高。 You can only make it faster by collecting less data (eg dataset_name.select(column_name).distinct().collect()
would typically be faster).您只能通过收集更少的数据来使其更快(例如
dataset_name.select(column_name).distinct().collect()
通常会更快)。
Spark is optimized for distributing datasets across a cluster and running computations in parallel. Spark 针对在集群中分布数据集和并行运行计算进行了优化。 The distributed nature of Spark makes computations that collect results on a single node comparatively slow.
Spark 的分布式特性使得在单个节点上收集结果的计算速度相对较慢。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.