简体   繁体   English

如何有效地将Spark数据帧列转换为Numpy数组?

[英]How can I convert Spark dataframe column to Numpy array efficiently?

I have a Spark dataframe with around 1 million rows. 我有一个大约有100万行的Spark数据帧。 I am using pyspark and have to apply box-cox transformation from scipy library on each column of the dataframe. 我正在使用pyspark并且必须在数据帧的每一列上从scipy库应用box-cox转换。 But the box-cox function allows only 1-d numpy array as input. 但是box-cox函数只允许1-d numpy数组作为输入。 How can I do this efficiently? 我怎样才能有效地做到这一点?

Is numpy array distributed on spark or it collects all the elements to single node on which driver program is running? numpy数组是分布在spark上还是它将所有元素收集到运行驱动程序的单个节点上?

suppose df is my dataframe with column as C1 then, I want to perform the operation similar to this suppose df is my dataframe with column as C1然后,我想执行类似于此的操作

stats.boxcox(df.select("C1"))

The dataframes/RDD in Spark allow abstracting from how the processing is distributed. Spark中的数据帧/ RDD允许从处理的分布方式中抽象出来。

To do what you require, I think a UDF can be very useful. 为了做你需要的,我认为UDF非常有用。 Here you can see an example of its use: 在这里您可以看到它的使用示例:

Functions from Python packages for udf() of Spark dataframe 来自Python包的函数用于Spark数据帧的udf()

I have a workaround that solve the issue but not sure is the optimal solution in term of performance as you are switching between pyspark and pandas dataframes: 我有一个解决问题的解决方法但不确定是在性能方面的最佳解决方案,因为您在pyspark和pandas数据帧之间切换:

dfpd = df.toPandas()
colName = 'YOUR_COLUMN_NAME'
colBCT_Name = colName + '_BCT'
print colBCT_Name
maxVal = dfpd[colName][dfpd[colName].idxmax()]
minVal = dfpd[colName][dfpd[colName].idxmin()]
print maxVal
print minVal

col_bct, l = stats.boxcox(dfpd[colName]- minVal +1)
col_bct = col_bct*l/((maxVal +1)**l-1)
col_bct =pd.Series(col_bct)
dfpd[colBCT_Name] = col_bct
df = sqlContext.createDataFrame(dfpd)
df.show(2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM