简体   繁体   English

从 DataFrame (Java/Spark) 中选择指定的列

[英]Select specified columns from DataFrame (Java/Spark)

I have the following problem:我有以下问题:

I've a schema(1) and a DataFrame with an other schema(2).我有一个模式(1)和一个带有其他模式(2)的数据框。 The DataFrames schema has only one difference to the other schema, it has one more column. DataFrames 模式与其他模式只有一个区别,它多一列。 Now, I want to select columns from the DataFrame, which were specified in schema(1).现在,我想从 DataFrame 中选择列,这些列在 schema(1) 中指定。

Example:例子:

StructType schema; //specified in constructor
DataFrame df_old; //given as parameter
DataFrame df_new = df_old.select(schema.fieldNames());

This won't work, because select() needs two parameter and only one is given.这是行不通的,因为 select() 需要两个参数并且只给出一个。 So, my idea was:所以,我的想法是:

StructType schema; //specified in constructor
DataFrame df_old; //given as parameter

String[] columns = schema.fieldNames(); //get column names as string array
String   first_col = columns[0]; // get first element of string array
columns = Arrays.copyOfRange(columns, 1, columns.length); //remove first element

DataFrame df_new = df_old.select(first_col,columns);

I think, that this is not the best way, because copyOfRange() would cost a lot of time.我认为,这不是最好的方法,因为 copyOfRange() 会花费很多时间。 Especially, if there are a lot of columns and I need to run this multiply times.特别是,如果有很多列并且我需要运行这个乘法次数。

Does somebody has a better idea?有人有更好的主意吗?

Thanks for your answers.感谢您的回答。 :) :)

//try this 
StructType schema; //specified in constructor
DataFrame df_old; //given as parameter

String[] columns = schema.fieldNames(); 
Column[] colList = new Column[columns.length];
for(int i=0; i<columns.length; i++){
    colList[I]=columns[i];`enter code here`
}

DataFrame df_new = df_old.select(colList); DataFrame df_new = df_old.select(colList);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM