有没有更快的方法将 pyspark dataframe 列转换为 python 列表？（Collect() 非常慢）

Question

I am trying to store a column of pyspark dataframe into python list using collect function.我正在尝试将 pyspark dataframe 的一列存储到 python 列表中，使用收集 ZC1C425268E17385D14ZA50744。 eg例如

list_a = [row[column_name] for row in dataset_name.collect()] list_a = [row[column_name] for dataset_name.collect() 中的行]

but this is very slow process and takes more than 10 seconds for a dataframe of 3 columns and 27 rows.但这是一个非常缓慢的过程，对于 3 列和 27 行的 dataframe 需要 10 多秒。

is there a faster way to do so?有更快的方法吗？

I tried caching the data before this step .我尝试在此步骤之前缓存数据。 With this step, the above query is being executed in 2 seconds but cache step itself is taking around 7-8 seconds so my purpose of reducing time is not full filled.通过这一步，上述查询将在 2 秒内执行，但缓存步骤本身大约需要 7-8 秒，因此我减少时间的目的并未完全实现。

And my code is such that i need to rebuild the dataframe everytime before this step so need to do cache again so this step(caching the dataframe) is not helping a lot in time reduction.而且我的代码是这样的，我需要在此步骤之前每次都重建 dataframe，因此需要再次进行缓存，因此此步骤（缓存数据帧）对缩短时间没有太大帮助。

Thanks in advance!提前致谢！

Answer 1

Your code can be slightly optimized by only collecting one column of data:您的代码可以通过只收集一列数据来稍微优化：

list_a = [row[column_name] for row in dataset_name.select(column_name).collect()]

This code is cleaner if you use quinn :如果您使用quinn ，则此代码更清晰：

import quinn

list_a = quinn.column_to_list(df, col_name)

collect() transfers all the data to the driver node and is expensive. collect()将所有数据传输到驱动节点，成本很高。 You can only make it faster by collecting less data (eg dataset_name.select(column_name).distinct().collect() would typically be faster).您只能通过收集更少的数据来使其更快（例如dataset_name.select(column_name).distinct().collect()通常会更快）。

Spark is optimized for distributing datasets across a cluster and running computations in parallel. Spark 针对在集群中分布数据集和并行运行计算进行了优化。 The distributed nature of Spark makes computations that collect results on a single node comparatively slow. Spark 的分布式特性使得在单个节点上收集结果的计算速度相对较慢。

有没有更快的方法将 pyspark dataframe 列转换为 python 列表？（Collect() 非常慢）

问题描述

1 个解决方案

解决方案1
3 2020-07-20 17:47:09

有没有更快的方法将 pyspark dataframe 列转换为 python 列表？ （Collect() 非常慢）

问题描述

1 个解决方案

解决方案1 3 2020-07-20 17:47:09

有没有更快的方法将 pyspark dataframe 列转换为 python 列表？（Collect() 非常慢）

解决方案1
3 2020-07-20 17:47:09