如何避免在Apache Spark数据框中的列选择中进行硬编码斯卡拉

Question

I have the following data frame and I need to run logistic regression using spark ml on it: 我有以下数据框，并且需要在其上使用spark ml运行logistic回归：

uid  a  b  c  label d
1    0  1  3  0     2
2    3  0  0  1     0

While using the the ml package, i came to know that I need to create the data in the format 在使用ml软件包时，我知道我需要以以下格式创建数据

label  feature
0      [0,1,3,2]
1      [3,0,0,0]

Now i came across VectorAssembler to create the feature column and while doing so I need to do something like 现在，我遇到了VectorAssembler来创建功能列，同时我需要做类似的事情

val assembler = new VectorAssembler()
.setInputCols(Array("a", "b", "c", "d"))
.setOutputCol("features")

Is there anyway i can avoid the hardcoding of individual feature column names 无论如何，我可以避免对单个功能列名称进行硬编码

Answer 1

Depends on your data. 取决于您的数据。 If you know that you will always have a certain set of columns that is not a part for your feature vector (uid and label) AND can assume that all other columns are, you can do like this: 如果您知道总会有某些列而不是特征向量（uid和label）的一部分，并且可以假设所有其他列都是，则可以这样：

// df is your data frame
val assembler = new VectorAssembler()
.setInputCols(df.columns.diff(Array("uid","label")))
.setOutputCol("features")

如何避免在Apache Spark数据框中的列选择中进行硬编码斯卡拉

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-05-23 11:56:43

如何避免在Apache Spark数据框中的列选择中进行硬编码 斯卡拉

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-05-23 11:56:43

如何避免在Apache Spark数据框中的列选择中进行硬编码斯卡拉

解决方案1
0 已采纳 2016-05-23 11:56:43