[英]Spark - remove special characters from rows Dataframe with different column types
假设我有一个包含许多列的数据框,其中有些是字符串类型,有些是int型,而有些是map型。
例如, 字段/列types: stringType|intType|mapType<string,int>|...
|--------------------------------------------------------------------------
| myString1 |myInt1| myMap1 |...
|--------------------------------------------------------------------------
|"this_is_#string"| 123 |{"str11_in#map":1,"str21_in#map":2, "str31_in#map": 31}|...
|"this_is_#string"| 456 |{"str12_in#map":1,"str22_in#map":2, "str32_in#map": 32}|...
|"this_is_#string"| 789 |{"str13_in#map":1,"str23_in#map":2, "str33_in#map": 33}|...
|--------------------------------------------------------------------------
我想从String和Map类型的所有列中删除一些字符,例如'_'和'#',因此结果 Dataframe / RDD将为:
|------------------------------------------------------------------------
|myString1 |myInt1| myMap1|... |
|------------------------------------------------------------------------
|"thisisstring"| 123 |{"str11inmap":1,"str21inmap":2, "str31inmap": 31}|...
|"thisisstring"| 456 |{"str12inmap":1,"str22inmap":2, "str32inmap": 32}|...
|"thisisstring"| 789 |{"str13inmap":1,"str23inmap":2, "str33inmap": 33}|...
|-------------------------------------------------------------------------
我不确定将Dataframe转换为RDD并使用它或在Dataframe中执行工作是否更好。
另外,不确定如何以最佳方式处理具有不同列类型的regexp(我正在唱scala )。 我想对这两种类型的所有列(字符串和映射)执行此操作,尝试避免使用类似以下的列名:
def cleanRows(mytabledata: DataFrame): RDD[String] = {
//this will do the work for a specific column (myString1) of type string
val oneColumn_clean = mytabledata.withColumn("myString1", regexp_replace(col("myString1"),"[_#]",""))
...
//return type can be RDD or Dataframe...
}
有没有简单的解决方案来执行此操作? 谢谢
一种选择是定义两个udf以分别处理字符串类型列和Map类型列:
import org.apache.spark.sql.functions.udf
val df = Seq(("this_is#string", 3, Map("str1_in#map" -> 3))).toDF("myString", "myInt", "myMap")
df.show
+--------------+-----+--------------------+
| myString|myInt| myMap|
+--------------+-----+--------------------+
|this_is#string| 3|Map(str1_in#map -...|
+--------------+-----+--------------------+
1)Udf处理字符串类型的列:
def remove_string: String => String = _.replaceAll("[_#]", "")
def remove_string_udf = udf(remove_string)
2)Udf处理Map类型的列:
def remove_map: Map[String, Int] => Map[String, Int] = _.map{ case (k, v) => k.replaceAll("[_#]", "") -> v }
def remove_map_udf = udf(remove_map)
3)将udfs应用于相应的列以进行清理:
df.withColumn("myString", remove_string_udf($"myString")).
withColumn("myMap", remove_map_udf($"myMap")).show
+------------+-----+-------------------+
| myString|myInt| myMap|
+------------+-----+-------------------+
|thisisstring| 3|Map(str1inmap -> 3)|
+------------+-----+-------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.