[英]How to convert a pyspark dataframe column to numpy array
I am trying to convert a pyspark dataframe column having approximately 90 million rows into a numpy array.我正在尝试将具有大约 9000 万行的 pyspark dataframe 列转换为 numpy 数组。
I need the array as an input for scipy.optimize.minimize
function.我需要该数组作为scipy.optimize.minimize
function 的输入。
I have tried both converting to Pandas and using collect()
, but these methods are very time consuming.我尝试过转换为 Pandas 和使用collect()
,但这些方法非常耗时。
I am new to PySpark, If there is a faster and better approach to do this, Please help.我是 PySpark 的新手,如果有更快更好的方法来做到这一点,请帮忙。
Thanks谢谢
This is how my dataframe looks like.这就是我的 dataframe 的样子。
+----------+
|Adolescent|
+----------+
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
+----------+
You will have to call a .collect()
in any way.您必须以任何方式调用.collect()
。 To create a numpy array from the pyspark dataframe, you can use:要从 pyspark dataframe 创建 numpy 阵列,您可以使用:
adoles = np.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array
You can convert it to a pandas dataframe using toPandas() , and you can then convert it to numpy array using .values
.您可以使用toPandas()将其转换为 pandas dataframe ,然后可以使用 .values 将其转换为.values
数组。
pdf = df.toPandas()
adoles = df["Adolescent"].values
Or simply:或者简单地说:
adoles = df.select("Adolescent").toPandas().values #.reshape(-1) for 1-D array
For distributed arrays, you can try Dask Arrays对于分布式 arrays,您可以尝试Dask Arrays
I haven't tested this, but assuming it would work the same as numpy (might have inconsistencies):我没有对此进行测试,但假设它与 numpy 相同(可能存在不一致):
import dask.array as da
adoles = da.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array
Another way is to convert the selected column to RDD, then flatten by extracting the value of each Row
(can abuse .keys()
), then convert to numpy array:另一种方法是将选定的列转换为 RDD,然后通过提取每一Row
的值进行展平(可以滥用.keys()
),然后转换为 numpy 数组:
x = df.select("colname").rdd.map(lambda r: r[0]).collect() # python list
np.array(x) # numpy array
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.