[英]How to avoid hardcoding in column selection in data frame in apache spark | Scala
I have the following data frame and I need to run logistic regression using spark ml on it: 我有以下数据框,并且需要在其上使用spark ml运行logistic回归:
uid a b c label d
1 0 1 3 0 2
2 3 0 0 1 0
While using the the ml package, i came to know that I need to create the data in the format 在使用ml软件包时,我知道我需要以以下格式创建数据
label feature
0 [0,1,3,2]
1 [3,0,0,0]
Now i came across VectorAssembler to create the feature column and while doing so I need to do something like 现在,我遇到了VectorAssembler来创建功能列,同时我需要做类似的事情
val assembler = new VectorAssembler()
.setInputCols(Array("a", "b", "c", "d"))
.setOutputCol("features")
Is there anyway i can avoid the hardcoding of individual feature column names 无论如何,我可以避免对单个功能列名称进行硬编码
Depends on your data. 取决于您的数据。 If you know that you will always have a certain set of columns that is not a part for your feature vector (uid and label) AND can assume that all other columns are, you can do like this:
如果您知道总会有某些列而不是特征向量(uid和label)的一部分,并且可以假设所有其他列都是,则可以这样:
// df is your data frame
val assembler = new VectorAssembler()
.setInputCols(df.columns.diff(Array("uid","label")))
.setOutputCol("features")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.