简体   繁体   English

Scala 将字符串列转换并拆分为数据框中的 MapType 列

[英]Scala transform and split String Column to MapType Column in Dataframe

I have a DataFrame with a column, containing tracking request urls with fields inside, that looks like this我有一个带有列的 DataFrame,其中包含带有字段的跟踪请求 url,看起来像这样

df.show(truncate = false)
+--------------------------------
| request_uri
+-----------------------------------
| /i?aid=fptplay&ast=1582163970763&av=4.6.1&did=83295772a8fee349 ...
| /i?p=fplay-ottbox-2019&av=2.0.18&nt=wifi&ov=9&tv=1.0.0&tz=GMT%2B07%3A00 ...
| ...

I need to transform this column to something that looks like this我需要将此列转换为如下所示的内容

df.show(truncate = false)
+--------------------------------
| request_uri
+--------------------------------
| (aid -> fptplay, ast -> 1582163970763, tz -> [timezone datatype], nt -> wifi , ...) 
| (p -> fplay-ottbox-2019, av -> 2.0.18, ov -> 9, tv -> 1.0.0 , ...) 
| ...

Basically I have to split the field names (delimiter = "&" ) and their values into a MapType of some sort, and add that to the column.基本上,我必须将字段名称(分隔符 = "&" )及其值拆分为某种类型的 MapType,并将其添加到列中。

Can someone give me pointers how to write a custom function to split the string column into a MapType column?有人可以指导我如何编写自定义函数将字符串列拆分为 MapType 列吗? I'm told to use withColumn() and mapPartition but I don't know how to implement it in a way that will split the strings and cast them to MapType.有人告诉我使用 withColumn() 和 mapPartition,但我不知道如何以拆分字符串并将它们转换为 MapType 的方式实现它。

Any help even though minimal would be heartily appreciated.任何帮助,即使是最小的,也将不胜感激。 I'm completely new to Scala and have been stuck on this for a week.我对 Scala 完全陌生,并且已经坚持了一个星期。

The solution is to use UserDefinedFunctions .解决方案是使用UserDefinedFunctions

Let's take the problem one step at a time.让我们一步一步解决这个问题。

// We need a function which converts strings into maps
// based on the format of request uris
def requestUriToMap(s: String): Map[String, String] = {
  s.stripPrefix("/i?").split("&").map(elem => {
    val pair = elem.split("=")     
    (pair(0), pair(1)) // evaluate each element to a tuple
  }).toMap
}

// Now we convert this function into a UserDefinedFunction.
import org.apache.spark.sql.functions.{col, udf}
// Given to a request uri string, convert it to map, the correct format is assumed.
val requestUriToMapUdf = udf((requestUri: String) => requestUriToMap(requestUri))

Now, we test.现在,我们测试。

// Test data
val df = Seq(
  ("/i?aid=fptplay&ast=1582163970763&av=4.6.1&did=83295772a8fee349"),
  ("/i?p=fplay-ottbox-2019&av=2.0.18&nt=wifi&ov=9&tv=1.0.0&tz=GMT%2B07%3A00")
).toDF("request_uri")

df.show(false)
//+-----------------------------------------------------------------------+
//|request_uri                                                            |
//+-----------------------------------------------------------------------+
//|/i?aid=fptplay&ast=1582163970763&av=4.6.1&did=83295772a8fee349         |
//|/i?p=fplay-ottbox-2019&av=2.0.18&nt=wifi&ov=9&tv=1.0.0&tz=GMT%2B07%3A00|
//+-----------------------------------------------------------------------+

// Now we execute our UDF to create a column, using the same name replaces that column
val mappedDf = df.withColumn("request_uri", requestUriToMapUdf(col("request_uri")))
mappedDf.show(false)
//+---------------------------------------------------------------------------------------------+
//|request_uri                                                                                  |
//+---------------------------------------------------------------------------------------------+
//|[aid -> fptplay, ast -> 1582163970763, av -> 4.6.1, did -> 83295772a8fee349]                 |
//|[av -> 2.0.18, ov -> 9, tz -> GMT%2B07%3A00, tv -> 1.0.0, p -> fplay-ottbox-2019, nt -> wifi]|
//+---------------------------------------------------------------------------------------------+

mappedDf.printSchema
//root
// |-- request_uri: map (nullable = true)
// |    |-- key: string
// |    |-- value: string (valueContainsNull = true)

mappedDf.schema
//org.apache.spark.sql.types.StructType = StructType(StructField(request_uri,MapType(StringType,StringType,true),true))

And that's what you wanted.这就是你想要的。


Alternative: If you're not sure if your string conforms, you can try a different variation of the function which succeeds even if the string doesn't conform to assumed format (doesn't contain = or input is an empty String).替代方案:如果您不确定您的字符串是否符合,您可以尝试不同的函数变体,即使字符串不符合假定的格式(不包含=或输入是空字符串),该变体也会成功。

def requestUriToMapImproved(s: String): Map[String, String] = {
  s.stripPrefix("/i?").split("&").map(elem => {
    val pair = elem.split("=")
    pair.length match {
      case 0 => ("", "") // in case the given string produces an array with no elements e.g. "=".split("=") == Array() 
      case 1 => (pair(0), "") // in case the given string contains no = and produces a single element e.g. "potato".split("=") == Array("potato")
      case _ => (pair(0), pair(1)) // normal case e.g. "potato=masher".split("=") == Array("potato", "masher")
    }
  }).toMap
}

The following code performs a two phase split process.以下代码执行两相拆分过程。 Because uris don´t have an specific structure you can do it with an UDF like:因为 uris 没有特定的结构,所以您可以使用 UDF 来实现,例如:

val keys = List("aid", "p", "ast", "av", "did", "nt", "ov", "tv", "tz")

def convertToMap(keys: List[String]) = udf  {
  (in : mutable.WrappedArray[String]) =>
    in.foldLeft[Map[String, String]](Map()){ (a, str) =>
      keys.flatMap { key =>
        val regex = s"""${key}="""
        val arr = str.split(regex)
        val value = {
          if(arr.length == 2) arr(1)
            else ""
        }

        if(!value.isEmpty)
          a + (key -> value)
        else
          a
      }.toMap
    }
}

df.withColumn("_tmp",
  split($"request_uri","""((&)|(\?))"""))
  .withColumn("map_result", convertToMap(keys)($"_tmp"))
  .select($"map_result")
  .show(false)

it gives a MapType column:它给出了一个 MapType 列:

+------------------------------------------------------------------------------------------------+
|map_result                                                                                      |
+------------------------------------------------------------------------------------------------+
|Map(aid -> fptplay, ast -> 1582163970763, av -> 4.6.1, did -> 83295772a8fee349)                 |
|Map(av -> 2.0.18, ov -> 9, tz -> GMT%2B07%3A00, tv -> 1.0.0, p -> fplay-ottbox-2019, nt -> wifi)|
+------------------------------------------------------------------------------------------------+ 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM