[英]Spark DataFrame and renaming multiple columns (Java)
Is there any nicer way to prefix or rename all or multiple columns at the same time of a given SparkSQL DataFrame
than calling multiple times dataFrame.withColumnRenamed()
?有没有比多次调用
dataFrame.withColumnRenamed()
更好的方法来为给定 SparkSQL DataFrame
所有或多个列同时添加前缀或重命名?
An example would be if I want to detect changes (using full outer join).一个例子是,如果我想检测更改(使用完全外连接)。 Then I'm left with two
DataFrame
s with the same structure.然后我剩下两个具有相同结构的
DataFrame
。
I suggest to use the select() method to perform this.我建议使用 select() 方法来执行此操作。 In fact withColumnRenamed() method uses select() by itself.
实际上 withColumnRenamed() 方法本身使用 select() 。 Here is example how to rename multiple columns:
以下是如何重命名多列的示例:
import org.apache.spark.sql.functions._
val someDataframe: DataFrame = ...
val initialColumnNames = Seq("a", "b", "c")
val renamedColumns = initialColumnNames.map(name => col(name).as(s"renamed_$name"))
someDataframe.select(renamedColumns : _*)
I think this method can help you.我觉得这个方法可以帮到你。
public static Dataset<Row> renameDataFrame(Dataset<Row> dataset) {
for (String column : dataset.columns()) {
dataset = dataset.withColumnRenamed(column, SystemUtils.underscoreToCamelCase(column));
}
return dataset;
}
public static String underscoreToCamelCase(String underscoreName) {
StringBuilder result = new StringBuilder();
if (underscoreName != null && underscoreName.length() > 0) {
boolean flag = false;
for (int i = 0; i < underscoreName.length(); i++) {
char ch = underscoreName.charAt(i);
if ("_".charAt(0) == ch) {
flag = true;
} else {
if (flag) {
result.append(Character.toUpperCase(ch));
flag = false;
} else {
result.append(ch);
}
}
}
}
return result.toString();
}
I heve just found the answer我刚刚找到了答案
df1_r = df1.select(*(col(x).alias(x + '_df1') for x in df1.columns))
at stackoverflow here (see the end of the accepted answer)在此处的stackoverflow (请参阅已接受答案的结尾)
or (a <- 0 to newsales.columns.length - 1)
{
var new_c = newsales.columns(a).replace('(','_').replace(')',' ').trim
newsales_var = newsales.withColumnRenamed(newsales.columns(a),new_c)
}
Although it does not answer your question directly, but I always update column names one by one. 虽然它没有直接回答你的问题,但我总是逐个更新列名。 Since it updates only DF metadata, there is no harm (no performance impact) on updating column names one by one, eg:
由于它只更新DF元数据,因此逐个更新列名没有任何危害(没有性能影响),例如:
for c in DF.columns:
new_c = c.strip().replace(' ','_')
DF = DF.withColumnRenamed(c, new_c)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.