[英]Spark - remove special characters from rows Dataframe with different column types
[英]Replace Special characters of column names in Spark dataframe
我的輸入spark-dataframe
名為df
,
+---------------+----------------+-----------------------+
|Main_CustomerID|126+ Concentrate|2.5 Ethylhexyl_Acrylate|
+---------------+----------------+-----------------------+
| 725153| 3.0| 2.0|
| 873008| 4.0| 1.0|
| 625109| 1.0| 0.0|
+---------------+----------------+-----------------------+
我需要從df
的列名中刪除特殊字符,如下所示,
刪除+
將空格替換為underscore
dot
替換為underscore
所以我的df
應該像
+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
| 725153| 3.0| 2.0|
| 873008| 4.0| 1.0|
| 625109| 1.0| 0.0|
+---------------+---------------+-----------------------+
使用 Scala,我已經實現了這一點,
var tableWithColumnsRenamed = df
for (field <- tableWithColumnsRenamed.columns) {
tableWithColumnsRenamed = tableWithColumnsRenamed
.withColumnRenamed(field, field.replaceAll("\\.", "_"))
}
for (field <- tableWithColumnsRenamed.columns) {
tableWithColumnsRenamed = tableWithColumnsRenamed
.withColumnRenamed(field, field.replaceAll("\\+", ""))
}
for (field <- tableWithColumnsRenamed.columns) {
tableWithColumnsRenamed = tableWithColumnsRenamed
.withColumnRenamed(field, field.replaceAll(" ", "_"))
}
df = tableWithColumnsRenamed
我用的時候,
for (field <- tableWithColumnsRenamed.columns) {
tableWithColumnsRenamed = tableWithColumnsRenamed
.withColumnRenamed(field, field.replaceAll("\\.", "_"))
.withColumnRenamed(field, field.replaceAll("\\+", ""))
.withColumnRenamed(field, field.replaceAll(" ", "_"))
}
我得到的第一個列名是126 Concentrate
而不是126_Concentrate
但是我不喜歡 3 for 循環來替換這個。 我能得到解決方案嗎?
df
.columns
.foldLeft(df){(newdf, colname) =>
newdf.withColumnRenamed(colname, colname.replace(" ", "_").replace(".", "_"))
}
.show
您可以使用withColumnRenamed
regex replaceAllIn
和foldLeft
如下
val columns = df.columns
val regex = """[+._, ]+"""
val replacingColumns = columns.map(regex.r.replaceAllIn(_, "_"))
val resultDF = replacingColumns.zip(columns).foldLeft(df){(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)}
resultDF.show(false)
這應該給你
+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|725153 |3.0 |2.0 |
|873008 |4.0 |1.0 |
|625109 |1.0 |0.0 |
+---------------+---------------+-----------------------+
我希望答案有幫助
在 Java 中,您可以使用df.columns()
迭代列名,並用string replaceAll(regexPattern, IntendedCharreplacement)
替換每個標題字符串
然后使用withColumnRenamed(headerName, correctedHeaderName)
重命名df
標頭。
例如——
for (String headerName : dataset.columns()) {
String correctedHeaderName = headerName.replaceAll(" ","_").replaceAll("+","_");
dataset = dataset.withColumnRenamed(headerName, correctedHeaderName);
}
dataset.show();
Piggybacking Ramesh 的回答,這里是一個使用柯里化語法和 .transform() 方法的可重用函數,並使列小寫:
// Format all column names with regex with lower_case names
def formatAllColumns(regex_string:String)(df: DataFrame): DataFrame = {
val replacingColumns = df.columns.map(regex_string.r.replaceAllIn(_, "_"))
val resultDF:DataFrame = replacingColumns.zip(df.columns).foldLeft(df){
(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1.toLowerCase())
}
resultDF
}
val resultDF = df.transform(formatAllColumns(regex_string="""[+._(), ]+"""))
我們可以在使用 replaceAll 替換特殊字符后,通過將 column_name 映射到新名稱來刪除所有字符,並使用 spark scala 嘗試和測試這一行代碼。
df.select(
df.columns
.map(colName => col(s"`${colName}`").as(colName.replaceAll("\\.", "_").replaceAll(" ", "_"))): _*
).show(false)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.