简体   繁体   English

在 Spark Scala 中选择 Dataframe 的列

[英]Selecting columns of Dataframe in Spark Scala

If you want to select the first column of a dataframe this can be done:如果你想 select dataframe 的第一列可以这样做:

df.select(df.columns(0))

df.columns(0) returns a string, so by giving the name of the column, the select is able to get the column correctly. df.columns(0)返回一个字符串,因此通过给出列的名称,select 能够正确获取该列。

Now, suppose I want to select the first 3 columns of the dataset, this is what I would intuitively do:现在,假设我想 select 数据集的前 3 列,这就是我直观的做法:

df.select(df.columns.split(0,3):_*)

The _* operator would pass the array of strings as a varag to my understanding, and it would be the same as passing (df.column(1), df.column(2), df.column(3)) to the select statement. _*运算符会将字符串数组作为 varag 传递给我的理解,这与将(df.column(1), df.column(2), df.column(3))传递给 select 相同陈述。 However this doesn't work and it is necessary to do this:但是,这不起作用,有必要这样做:

import org.apache.spark.sql.functions.col
df.select(sf.columns.split(0,3).map(i => col(i)):_*))

What is going on?到底是怎么回事?

I think in the question you meant slice instead of split .我认为在问题中您的意思是slice而不是split

And as for your question, df.columns.slice(0,3):_* is meant to be passed to functions with *-parameters ( varargs ), ie if you call select(columns:_*) then there must be a function defined with varargs , eg def select(cols: String*) .至于你的问题, df.columns.slice(0,3):_*是为了传递给带有 *-parameters ( varargs ) 的函数,即如果你调用select(columns:_*)那么必须有一个function 用varargs定义,例如def select(cols: String*)

But there can only be one such function defined - no overloading here is possible.但是只能定义一个这样的 function - 这里不可能重载。 Example on why it's not possible to define two different functions with same vararg -parameter declaration:为什么不能用相同的 vararg -parameter 声明定义两个不同的函数的示例:

def select(cols: String*): String = "string"
select() // returns "string"
def select(cols: Column*): Int = 3
select() // now returns 3

And in Spark, that one function is defined not for String s but for Column s:在 Spark 中,一个 function 不是为String定义的,而是为Column定义的:

def select(cols: Column*)

For String s, the method is declared like this:对于String ,方法声明如下:

def select(col: String, cols: String*)

I suggest you to stick to Column s, like you do now, but with some syntax sugar:我建议您像现在一样坚持使用Column ,但要使用一些语法糖:

df.select(df.columns.slice(0,3).map(col):_*))

Or if there's a need to pass column names as String s, then you can use selectExpr :或者,如果需要将列名作为String传递,那么您可以使用selectExpr

df.selectExpr(df.columns.slice(0,3):_*)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM