简体   繁体   中英

How to create a count of nested JSON objects in a DataFrame row using Spark/Scala

I have a column full of JSON object strings that look like this:

"steps":{
    "step_1":{
        "conditions":{
        "complete_by":"2022-05-17",
        "requirement":100
                     },
        "status":"eligible",
        "type":"buy"
            },
    "step_2":{
        "conditions":{
        "complete_by":"2022-05-27",
        "requirement":100
                     },
        "status":"eligible",
        "type":"buy" 
}

In the steps object, there can be any number of steps (within reason).

My question is, how would I create another Dataframe column that counts the number of steps for each JSON string in that row/column?

I'm using Spark/Scala, so I created a UDF with the below:

def jsonCount (col):

val jsonCountUDF = udf(jsonCount)

val stepDF = stepData.withColumn("NumberOfSteps", jsonCountUDF(col("steps")))

This is where I get stuck. I'd like to loop though each row in the steps column and count the step objects in the steps object JSON string. Does anyone have experience with a similar task or know of a function to simplify this?

#make some data
str = "{\"steps\":{ \"step_1\":{\"conditions\":{ \"complete_by\":\"2022-05-17\", \"requirement\":100} }  , \"step_2\":{  \"status\":\"eligible\", \"type\":\"buy\"   }  }}"

#implement a function to return the count
def jsonCount ( jsonString ):
 import json
 json_obj = json.loads(jsonString)
 return len( json_obj["steps"] )

#define the udf
JSONCount = udf(jsonCount, IntegerType())

#create sample dataframe
df = spark.createDataFrame( [ [str] ], ["json"] )

#run udf on dataframe
df.select( df.json, JSONCount( df.json ).alias("StepCount") ).show()

+--------------------+---------+
|                json|StepCount|
+--------------------+---------+
|{"steps":{ "step_...|        2|
+--------------------+---------+

You can try to select that sub struct, then get columns size.

  stepSize=  df.select($"steps.*").columns.size

then add it to your df

df_steps = df.withColumn("NumberOfSteps",lit(stepSize))

Edit : Don't use UDF for that purpose ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM