如何将 pyspark dataframe 列转换为 numpy 数组

Question

I am trying to convert a pyspark dataframe column having approximately 90 million rows into a numpy array.我正在尝试将具有大约 9000 万行的 pyspark dataframe 列转换为 numpy 数组。

I need the array as an input for scipy.optimize.minimize function.我需要该数组作为scipy.optimize.minimize function 的输入。

I have tried both converting to Pandas and using collect() , but these methods are very time consuming.我尝试过转换为 Pandas 和使用collect() ，但这些方法非常耗时。

I am new to PySpark, If there is a faster and better approach to do this, Please help.我是 PySpark 的新手，如果有更快更好的方法来做到这一点，请帮忙。

Thanks谢谢

This is how my dataframe looks like.这就是我的 dataframe 的样子。

+----------+
|Adolescent|
+----------+
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
+----------+

Answer 1

#1 #1

You will have to call a .collect() in any way.您必须以任何方式调用.collect() 。 To create a numpy array from the pyspark dataframe, you can use:要从 pyspark dataframe 创建 numpy 阵列，您可以使用：

adoles = np.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array

#2 #2

You can convert it to a pandas dataframe using toPandas() , and you can then convert it to numpy array using .values .您可以使用toPandas()将其转换为 pandas dataframe ，然后可以使用 .values 将其转换为.values数组。

pdf = df.toPandas()
adoles = df["Adolescent"].values

Or simply:或者简单地说：

adoles = df.select("Adolescent").toPandas().values #.reshape(-1) for 1-D array

#3 #3

For distributed arrays, you can try Dask Arrays对于分布式 arrays，您可以尝试Dask Arrays

I haven't tested this, but assuming it would work the same as numpy (might have inconsistencies):我没有对此进行测试，但假设它与 numpy 相同（可能存在不一致）：

import dask.array as da
adoles = da.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array

Answer 2

Another way is to convert the selected column to RDD, then flatten by extracting the value of each Row (can abuse .keys() ), then convert to numpy array:另一种方法是将选定的列转换为 RDD，然后通过提取每一Row的值进行展平（可以滥用.keys() ），然后转换为 numpy 数组：

x = df.select("colname").rdd.map(lambda r: r[0]).collect()  # python list
np.array(x)  # numpy array

如何将 pyspark dataframe 列转换为 numpy 数组

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-09-30 08:01:45

#1 #1

#2 #2

#3 #3

解决方案2
0 2021-10-25 03:48:23

如何将 pyspark dataframe 列转换为 numpy 数组

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-09-30 08:01:45

#1 #1

#2 #2

#3 #3

解决方案2 0 2021-10-25 03:48:23

解决方案1
2 已采纳 2019-09-30 08:01:45

解决方案2
0 2021-10-25 03:48:23