[英]Scala transform and split String Column to MapType Column in Dataframe
I have a DataFrame with a column, containing tracking request urls with fields inside, that looks like this我有一个带有列的 DataFrame,其中包含带有字段的跟踪请求 url,看起来像这样
df.show(truncate = false)
+--------------------------------
| request_uri
+-----------------------------------
| /i?aid=fptplay&ast=1582163970763&av=4.6.1&did=83295772a8fee349 ...
| /i?p=fplay-ottbox-2019&av=2.0.18&nt=wifi&ov=9&tv=1.0.0&tz=GMT%2B07%3A00 ...
| ...
I need to transform this column to something that looks like this我需要将此列转换为如下所示的内容
df.show(truncate = false)
+--------------------------------
| request_uri
+--------------------------------
| (aid -> fptplay, ast -> 1582163970763, tz -> [timezone datatype], nt -> wifi , ...)
| (p -> fplay-ottbox-2019, av -> 2.0.18, ov -> 9, tv -> 1.0.0 , ...)
| ...
Basically I have to split the field names (delimiter = "&" ) and their values into a MapType of some sort, and add that to the column.基本上,我必须将字段名称(分隔符 = "&" )及其值拆分为某种类型的 MapType,并将其添加到列中。
Can someone give me pointers how to write a custom function to split the string column into a MapType column?有人可以指导我如何编写自定义函数将字符串列拆分为 MapType 列吗? I'm told to use withColumn() and mapPartition but I don't know how to implement it in a way that will split the strings and cast them to MapType.
有人告诉我使用 withColumn() 和 mapPartition,但我不知道如何以拆分字符串并将它们转换为 MapType 的方式实现它。
Any help even though minimal would be heartily appreciated.任何帮助,即使是最小的,也将不胜感激。 I'm completely new to Scala and have been stuck on this for a week.
我对 Scala 完全陌生,并且已经坚持了一个星期。
The solution is to use UserDefinedFunctions .解决方案是使用UserDefinedFunctions 。
Let's take the problem one step at a time.让我们一步一步解决这个问题。
// We need a function which converts strings into maps
// based on the format of request uris
def requestUriToMap(s: String): Map[String, String] = {
s.stripPrefix("/i?").split("&").map(elem => {
val pair = elem.split("=")
(pair(0), pair(1)) // evaluate each element to a tuple
}).toMap
}
// Now we convert this function into a UserDefinedFunction.
import org.apache.spark.sql.functions.{col, udf}
// Given to a request uri string, convert it to map, the correct format is assumed.
val requestUriToMapUdf = udf((requestUri: String) => requestUriToMap(requestUri))
Now, we test.现在,我们测试。
// Test data
val df = Seq(
("/i?aid=fptplay&ast=1582163970763&av=4.6.1&did=83295772a8fee349"),
("/i?p=fplay-ottbox-2019&av=2.0.18&nt=wifi&ov=9&tv=1.0.0&tz=GMT%2B07%3A00")
).toDF("request_uri")
df.show(false)
//+-----------------------------------------------------------------------+
//|request_uri |
//+-----------------------------------------------------------------------+
//|/i?aid=fptplay&ast=1582163970763&av=4.6.1&did=83295772a8fee349 |
//|/i?p=fplay-ottbox-2019&av=2.0.18&nt=wifi&ov=9&tv=1.0.0&tz=GMT%2B07%3A00|
//+-----------------------------------------------------------------------+
// Now we execute our UDF to create a column, using the same name replaces that column
val mappedDf = df.withColumn("request_uri", requestUriToMapUdf(col("request_uri")))
mappedDf.show(false)
//+---------------------------------------------------------------------------------------------+
//|request_uri |
//+---------------------------------------------------------------------------------------------+
//|[aid -> fptplay, ast -> 1582163970763, av -> 4.6.1, did -> 83295772a8fee349] |
//|[av -> 2.0.18, ov -> 9, tz -> GMT%2B07%3A00, tv -> 1.0.0, p -> fplay-ottbox-2019, nt -> wifi]|
//+---------------------------------------------------------------------------------------------+
mappedDf.printSchema
//root
// |-- request_uri: map (nullable = true)
// | |-- key: string
// | |-- value: string (valueContainsNull = true)
mappedDf.schema
//org.apache.spark.sql.types.StructType = StructType(StructField(request_uri,MapType(StringType,StringType,true),true))
And that's what you wanted.这就是你想要的。
Alternative: If you're not sure if your string conforms, you can try a different variation of the function which succeeds even if the string doesn't conform to assumed format (doesn't contain =
or input is an empty String).替代方案:如果您不确定您的字符串是否符合,您可以尝试不同的函数变体,即使字符串不符合假定的格式(不包含
=
或输入是空字符串),该变体也会成功。
def requestUriToMapImproved(s: String): Map[String, String] = {
s.stripPrefix("/i?").split("&").map(elem => {
val pair = elem.split("=")
pair.length match {
case 0 => ("", "") // in case the given string produces an array with no elements e.g. "=".split("=") == Array()
case 1 => (pair(0), "") // in case the given string contains no = and produces a single element e.g. "potato".split("=") == Array("potato")
case _ => (pair(0), pair(1)) // normal case e.g. "potato=masher".split("=") == Array("potato", "masher")
}
}).toMap
}
The following code performs a two phase split process.以下代码执行两相拆分过程。 Because uris don´t have an specific structure you can do it with an UDF like:
因为 uris 没有特定的结构,所以您可以使用 UDF 来实现,例如:
val keys = List("aid", "p", "ast", "av", "did", "nt", "ov", "tv", "tz")
def convertToMap(keys: List[String]) = udf {
(in : mutable.WrappedArray[String]) =>
in.foldLeft[Map[String, String]](Map()){ (a, str) =>
keys.flatMap { key =>
val regex = s"""${key}="""
val arr = str.split(regex)
val value = {
if(arr.length == 2) arr(1)
else ""
}
if(!value.isEmpty)
a + (key -> value)
else
a
}.toMap
}
}
df.withColumn("_tmp",
split($"request_uri","""((&)|(\?))"""))
.withColumn("map_result", convertToMap(keys)($"_tmp"))
.select($"map_result")
.show(false)
it gives a MapType column:它给出了一个 MapType 列:
+------------------------------------------------------------------------------------------------+
|map_result |
+------------------------------------------------------------------------------------------------+
|Map(aid -> fptplay, ast -> 1582163970763, av -> 4.6.1, did -> 83295772a8fee349) |
|Map(av -> 2.0.18, ov -> 9, tz -> GMT%2B07%3A00, tv -> 1.0.0, p -> fplay-ottbox-2019, nt -> wifi)|
+------------------------------------------------------------------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.