简体   繁体   中英

General processing on ListType, MapType, StructType fields of Spark dataframe in Scala in UDF?

How to perform general processing on Spark StructType in Scala like choosing field by name, iterating over map/list field, etc?

In spark dataframe, I have column "instances" of Type "ArrayType" with following schema:

instances[ArrayType]:
    0 [ StructType:
            name [StringType]
            address[StringType]
            experiences[MapType]:
                Company-1[StringType]:
                    StructType:
                        numYears[IntType]: 5
                        grade[IntType]
                Company-2[StringType]:
                    StructType:
                        numYears[IntType]:  12
                        grade[IntType]]
     1 [ StructType:
            name [StringType]
            address[StringType]
            experiences[MapType]:
                Company-1[StringType]:
                    StructType:
                        numYears[IntType]: 3
                        grade[IntType]
                Company-2[StringType]:
                    StructType:
                        numYears[IntType]:  9
                        grade[IntType]]

I need to convert this ArrayType column "instances" to column "totalExperience" of type

derived column "totalExperience" of type "MapType"[StringType -> IntType]
company-1: 8
company-2: 21

Note: (5 + 3 = 8 and 12 + 9 = 21)

Equivalent psuedo-code for this:

totalExperience = Map<String, Int>();
for (instance in instances) {
    for ((currentExperience, numYears) in instance.getExperiences().entries()) {
         if (!totalExperience.contains(currentExperience)) {
              totalExperience.put(currentExperience, 0);
         }

         totalExperience.put(currentExperience, totalExperience.get(currentExperience) + numYears);
    }
}

return totalExperience

I have written UDF for this as follows but I did not find any way of implementing above pseudo-code in Scala-spark:

  private val computeTotalExperience = udf(_ => MapType = (instances: ArrayType) => {
    val totalExperienceByCompany = DataTypes.createMapType(StringType, LongType)

    **How to iterate over "instances" with type as "ArrayType" ?**
    for (instance <- instances) {
      **How to access and iterate over "experiences" mapType field on instance ???**
      // Populate totalExperienceByCompany(MapType) with key as "company-1" name

    }

    delayReasons
  })

How to perform above General processing on ListType, MapType, StructType fields of Spark dataframe in Scala in UDF?

Check below code.

scala> df.printSchema
root
 |-- instances: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- address: string (nullable = true)
 |    |    |-- experiences: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: struct (valueContainsNull = true)
 |    |    |    |    |-- numYears: integer (nullable = true)
 |    |    |    |    |-- grade: string (nullable = true)
 |    |    |-- name: string (nullable = true)
scala> df.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------+
|instances                                                                                                                          |
+-----------------------------------------------------------------------------------------------------------------------------------+
|[[address_0, [Company-1 -> [5, 1], Company-2 -> [12, 1]], name_0], [address_1, [Company-1 -> [3, 1], Company-2 -> [9, 1]], name_1]]|
+-----------------------------------------------------------------------------------------------------------------------------------+                                                                                                                                                                                                   
scala> 
val expr = array(
    struct(lit("company-1").as("company"),$"instance.experiences.Company-1.numYears"),
    struct(lit("company-2").as("company"),$"instance.experiences.Company-2.numYears")
)
       
scala>  

df
.withColumn("instance",explode($"instances"))
.withColumn("company",explode(expr))
.select("company.*")
.groupBy($"company")
.agg(sum($"numYears").as("numYears"))
.select(map($"company",$"numYears").as("totalExperience"))
.show(false) 
                                                                                                                                                       
+-----------------+                                                                                                                                                                                
|totalExperience  |                                                                                                                                                                                
+-----------------+                                                                                                                                                                                
|[company-1 -> 8] |                                                                                                                                                                                
|[company-2 -> 21]|                                                                                                                                                                                
+-----------------+                                                                                                                                                                                
                     

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM