[英]Selecting columns of Dataframe in Spark Scala
If you want to select the first column of a dataframe this can be done:如果你想 select dataframe 的第一列可以这样做:
df.select(df.columns(0))
df.columns(0)
returns a string, so by giving the name of the column, the select is able to get the column correctly. df.columns(0)
返回一个字符串,因此通过给出列的名称,select 能够正确获取该列。
Now, suppose I want to select the first 3 columns of the dataset, this is what I would intuitively do:现在,假设我想 select 数据集的前 3 列,这就是我直观的做法:
df.select(df.columns.split(0,3):_*)
The _*
operator would pass the array of strings as a varag to my understanding, and it would be the same as passing (df.column(1), df.column(2), df.column(3))
to the select statement. _*
运算符会将字符串数组作为 varag 传递给我的理解,这与将(df.column(1), df.column(2), df.column(3))
传递给 select 相同陈述。 However this doesn't work and it is necessary to do this:但是,这不起作用,有必要这样做:
import org.apache.spark.sql.functions.col
df.select(sf.columns.split(0,3).map(i => col(i)):_*))
What is going on?到底是怎么回事?
I think in the question you meant slice
instead of split
.我认为在问题中您的意思是
slice
而不是split
。
And as for your question, df.columns.slice(0,3):_*
is meant to be passed to functions with *-parameters ( varargs ), ie if you call select(columns:_*)
then there must be a function defined with varargs , eg def select(cols: String*)
.至于你的问题,
df.columns.slice(0,3):_*
是为了传递给带有 *-parameters ( varargs ) 的函数,即如果你调用select(columns:_*)
那么必须有一个function 用varargs定义,例如def select(cols: String*)
。
But there can only be one such function defined - no overloading here is possible.但是只能定义一个这样的 function - 这里不可能重载。 Example on why it's not possible to define two different functions with same vararg -parameter declaration:
为什么不能用相同的 vararg -parameter 声明定义两个不同的函数的示例:
def select(cols: String*): String = "string"
select() // returns "string"
def select(cols: Column*): Int = 3
select() // now returns 3
And in Spark, that one function is defined not for String
s but for Column
s:在 Spark 中,一个 function 不是为
String
定义的,而是为Column
定义的:
def select(cols: Column*)
For String
s, the method is declared like this:对于
String
,方法声明如下:
def select(col: String, cols: String*)
I suggest you to stick to Column
s, like you do now, but with some syntax sugar:我建议您像现在一样坚持使用
Column
,但要使用一些语法糖:
df.select(df.columns.slice(0,3).map(col):_*))
Or if there's a need to pass column names as String
s, then you can use selectExpr
:或者,如果需要将列名作为
String
传递,那么您可以使用selectExpr
:
df.selectExpr(df.columns.slice(0,3):_*)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.