Spark-從具有不同列類型的行數據框中刪除特殊字符

Question

假設我有一個包含許多列的數據框，其中有些是字符串類型，有些是int型，而有些是map型。

例如， 字段/列types: stringType|intType|mapType<string,int>|...

|--------------------------------------------------------------------------
|  myString1      |myInt1|  myMap1                                              |...
|--------------------------------------------------------------------------
|"this_is_#string"| 123 |{"str11_in#map":1,"str21_in#map":2, "str31_in#map": 31}|...
|"this_is_#string"| 456 |{"str12_in#map":1,"str22_in#map":2, "str32_in#map": 32}|...
|"this_is_#string"| 789 |{"str13_in#map":1,"str23_in#map":2, "str33_in#map": 33}|...
|--------------------------------------------------------------------------

我想從String和Map類型的所有列中刪除一些字符，例如'_'和'＃'，因此結果 Dataframe / RDD將為：

|------------------------------------------------------------------------
|myString1     |myInt1|     myMap1|...                                 |
|------------------------------------------------------------------------
|"thisisstring"| 123 |{"str11inmap":1,"str21inmap":2, "str31inmap": 31}|...
|"thisisstring"| 456 |{"str12inmap":1,"str22inmap":2, "str32inmap": 32}|...
|"thisisstring"| 789 |{"str13inmap":1,"str23inmap":2, "str33inmap": 33}|...
|-------------------------------------------------------------------------

我不確定將Dataframe轉換為RDD並使用它或在Dataframe中執行工作是否更好。

另外，不確定如何以最佳方式處理具有不同列類型的regexp（我正在唱scala ）。 我想對這兩種類型的所有列（字符串和映射）執行此操作，嘗試避免使用類似以下的列名：

def cleanRows(mytabledata: DataFrame): RDD[String] = {

//this will do the work for a specific column (myString1) of type string
val oneColumn_clean = mytabledata.withColumn("myString1", regexp_replace(col("myString1"),"[_#]",""))

       ...
//return type can be RDD or Dataframe...
}

有沒有簡單的解決方案來執行此操作？ 謝謝

Answer 1

一種選擇是定義兩個udf以分別處理字符串類型列和Map類型列：

import org.apache.spark.sql.functions.udf
val df = Seq(("this_is#string", 3, Map("str1_in#map" -> 3))).toDF("myString", "myInt", "myMap")
df.show
+--------------+-----+--------------------+
|      myString|myInt|               myMap|
+--------------+-----+--------------------+
|this_is#string|    3|Map(str1_in#map -...|
+--------------+-----+--------------------+

1）Udf處理字符串類型的列：

def remove_string: String => String = _.replaceAll("[_#]", "")
def remove_string_udf = udf(remove_string)

2）Udf處理Map類型的列：

def remove_map: Map[String, Int] => Map[String, Int] = _.map{ case (k, v) => k.replaceAll("[_#]", "") -> v }
def remove_map_udf = udf(remove_map)

3）將udfs應用於相應的列以進行清理：

df.withColumn("myString", remove_string_udf($"myString")).
   withColumn("myMap", remove_map_udf($"myMap")).show

+------------+-----+-------------------+
|    myString|myInt|              myMap|
+------------+-----+-------------------+
|thisisstring|    3|Map(str1inmap -> 3)|
+------------+-----+-------------------+

Spark-從具有不同列類型的行數據框中刪除特殊字符

問題描述

1 個解決方案

解決方案1
4 2017-03-16 16:55:13

Spark-從具有不同列類型的行數據框中刪除特殊字符

問題描述

1 個解決方案

解決方案1 4 2017-03-16 16:55:13

解決方案1
4 2017-03-16 16:55:13