简体   繁体   中英

Parse a JSON column in a spark dataframe using Spark

Input:

caseid object_value
1 [{'dummyAcc':'12346','accountRequest':{'schemeCode':'ZEROQ1', 'CCZ':'SGD'}}]
2 [{'dummyAcc':'12347','accountRequest':{'schemeCode':'ZEROQ2', 'CCZ':'SGD'}}]
3 [{'dummyAcc':'12348','accountRequest':{'schemeCode':'ZEROQ5', 'CCZ':'SGD'}}]
4 [{'dummyAcc':'12349','accountRequest':{'schemeCode':'ZEROQ', 'CCZ':'SGD'}}]
5 [{'dummyAcc':'12350','accountRequest':{'schemeCode':'ZEROQ', 'CCZ':'SGD'}}]

Output:

caseid schemeCode CCZ
1 ZEROQ1 SGD
2 ZEROQ2 SGD
3 ZEROQ5 SGD
4 ZEROQ SGD
5 ZEROQ SGD

Kindly guide me achieving this output in spark, I am able to do this in python using a small sample data, but need to do this in spark due to data volume in production. Thanks in Advance

To extract json-like data, use the function from_json . It requires a schema as input. And your JSON is malformarted, therefore, you need to add the option {"allowSingleQuotes": "true"} .

from pyspark.sql import functions as F, types as T

schm = T.StructType(
    [
        T.StructField("dummyAcc", T.StringType()),
        T.StructField(
            "accountRequest",
            T.StructType(
                [
                    T.StructField("schemeCode", T.StringType()),
                    T.StructField("CCZ", T.StringType()),
                ]
            ),
        ),
    ]
)

df.withColumn(
    "object_value",
    F.from_json("object_value", schm, options={"allowSingleQuotes": "true"}),
).select(
    "caseid",
    "object_value.accountRequest.schemeCode",
    "object_value.accountRequest.CCZ",
).show()

+------+----------+---+                                                         
|caseid|schemeCode|CCZ|
+------+----------+---+
|     1|    ZEROQ1|SGD|
|     2|    ZEROQ2|SGD|
|     3|    ZEROQ5|SGD|
|     4|     ZEROQ|SGD|
|     5|     ZEROQ|SGD|
+------+----------+---+

You might use get_json_object , its straightforward

import pyspark.sql.functions as f

df = spark.createDataFrame([
  [1, """[{'dummyAcc':'12346','accountRequest':{'schemeCode':'ZEROQ1', 'CCZ':'SGD'}}]"""],
  [2, """[{'dummyAcc':'12347','accountRequest':{'schemeCode':'ZEROQ2', 'CCZ':'SGD'}}]"""],
  [3, """[{'dummyAcc':'12348','accountRequest':{'schemeCode':'ZEROQ5', 'CCZ':'SGD'}}]"""],
  [4, """[{'dummyAcc':'12349','accountRequest':{'schemeCode':'ZEROQ', 'CCZ':'SGD'}}]"""],
  [5, """[{'dummyAcc':'12350','accountRequest':{'schemeCode':'ZEROQ', 'CCZ':'SGD'}}]"""]
], schema='caseid int, object_value string')

final_df = (df
            .select('caseid', 
                    f.get_json_object('object_value', '$[*].accountRequest.schemeCode').alias('schemeCode'),
                    f.get_json_object('object_value', '$[*].accountRequest.CCZ').alias('CCZ')))

final_df.show(truncate=False)
# +------+----------+-----+
# |caseid|schemeCode|CCZ  |
# +------+----------+-----+
# |1     |"ZEROQ1"  |"SGD"|
# |2     |"ZEROQ2"  |"SGD"|
# |3     |"ZEROQ5"  |"SGD"|
# |4     |"ZEROQ"   |"SGD"|
# |5     |"ZEROQ"   |"SGD"|
# +------+----------+-----+

So a coworker once told me that regex_extract is faster than parsing the JSONs and I've always believed that... until today when I decided to run some timing experiments comparing it the two other solutions posted here using get_json_object and from_json .

The short answer is that all perform comparably, even when we complicate the JSONs by adding thousands of extra K:V pairs. The regex_extract method is actually consistently a bit slower in these tests.

Setup: proving each method works

import pyspark.sql.functions as fun
import pyspark.sql.types as t

case_ids = range(1,6)
data =  [
  '{"dummyAcc":"12346","accountRequest":{"schemeCode":"ZEROQ1", "CCZ":"SGD"}}',
  '{"dummyAcc":"12347","accountRequest":{"schemeCode":"ZEROQ2", "CCZ":"SGD"}}',
  '{"dummyAcc":"12348","accountRequest":{"schemeCode":"ZEROQ5", "CCZ":"SGD"}}',
  '{"dummyAcc":"12349","accountRequest":{"schemeCode":"ZEROQ", "CCZ":"SGD"}}',
  '{"dummyAcc":"12350","accountRequest":{"schemeCode":"ZEROQ", "CCZ":"SGD"}}'
]

df = spark.createDataFrame(pd.DataFrame({"caseid": case_ids, "object_value": data}))

##
# fun.from_json
##
schm = t.StructType(
    [
        t.StructField("dummyAcc", t.StringType()),
        t.StructField(
            "accountRequest",
            t.StructType(
                [
                    t.StructField("schemeCode", t.StringType()),
                    t.StructField("CCZ", t.StringType()),
                ]
            ),
        ),
    ]
)

def run_from_json(df):
  return df.withColumn("object_value", fun.from_json("object_value", schm, options={"allowSingleQuotes": "true"}))\
          .select(
            "caseid",
            "object_value.accountRequest.schemeCode",
            "object_value.accountRequest.CCZ",
        )

##
# get_json
##

def run_get_json(df):
  return df.select('caseid', 
                    fun.get_json_object('object_value', '$.accountRequest.schemeCode').alias('schemeCode'),
                    fun.get_json_object('object_value', '$.accountRequest.CCZ').alias('CCZ'))


##
# regexp_extract
##

def run_regexp_extract(df):
  return df.withColumn("schemeCode", fun.regexp_extract(fun.col("object_value"), '(.)("schemeCode":")(\w+)', 3))\
    .withColumn("CCZ", fun.regexp_extract(fun.col("object_value"), '(.)("CCZ":")(\w+)', 3))\
    .select("caseid", "schemeCode", "CCZ")

##
# Test them out
##

print("from_json")
run_from_json(df).show(truncate=False)

print("get_json")
run_get_json(df).show(truncate=False)

print("regexp_extract")
run_regexp_extract(df).show(truncate=False)


from_json
+------+----------+---+
|caseid|schemeCode|CCZ|
+------+----------+---+
|1     |ZEROQ1    |SGD|
|2     |ZEROQ2    |SGD|
|3     |ZEROQ5    |SGD|
|4     |ZEROQ     |SGD|
|5     |ZEROQ     |SGD|
+------+----------+---+

get_json
+------+----------+---+
|caseid|schemeCode|CCZ|
+------+----------+---+
|1     |ZEROQ1    |SGD|
|2     |ZEROQ2    |SGD|
|3     |ZEROQ5    |SGD|
|4     |ZEROQ     |SGD|
|5     |ZEROQ     |SGD|
+------+----------+---+


regexp_extract
+------+----------+---+
|caseid|schemeCode|CCZ|
+------+----------+---+
|1     |ZEROQ1    |SGD|
|2     |ZEROQ2    |SGD|
|3     |ZEROQ5    |SGD|
|4     |ZEROQ     |SGD|
|5     |ZEROQ     |SGD|
+------+----------+---+

Timing Part 1 -- Using Short JSONs I checked the wall clock time of running multiple iterations using the default compact JSONs defined above.

def time_run_method(df, n_it, meth, meth_name):
  t0 = time.time()
  for i in range(n_it):
    meth(df).count()
  td = time.time() - t0
  print(n)
  print("Time to count %d iterations: %s [sec]" % (n_it, "{:,}".format(td)))
  
for m, n in zip([run_from_json, run_get_json, run_regexp_extract], ["from_json", "get_json", "regexp_extract"]):
  time_run_method(df, 200, m, n)


from_json
Time to count 200 iterations: 15.918861389160156 [sec]

get_json
Time to count 200 iterations: 15.668830871582031 [sec]

regexp_extract
Time to count 200 iterations: 17.539576292037964 [sec]

Timing Part 2 -- Using Long JSONs I added two thousand key-value pairs to the JSONs to see if the extra overhead of deserializing them would change things. It did not. Perhaps this structure is too simple and the internal parsers are able to simply avoid the extra keys or they just don't present a lot of overhead given how flat the structure is. I don't know.

cruft = json.dumps({k:v for k,v in enumerate(range(2000))})

data = [
  '{ "cruft": %s, "dummyAcc":"12346","accountRequest":{"schemeCode":"ZEROQ1", "CCZ":"SGD"}}' % cruft,
  '{ "cruft": %s, "dummyAcc":"12347","accountRequest":{"schemeCode":"ZEROQ2", "CCZ":"SGD"}}' % cruft,
  '{ "cruft": %s, "dummyAcc":"12348","accountRequest":{"schemeCode":"ZEROQ5", "CCZ":"SGD"}}' % cruft,
  '{ "cruft": %s, "dummyAcc":"12349","accountRequest":{"schemeCode":"ZEROQ", "CCZ":"SGD"}}' % cruft,
  '{ "cruft": %s, "dummyAcc":"12350","accountRequest":{"schemeCode":"ZEROQ", "CCZ":"SGD"}}' % cruft
]

df2 = spark.createDataFrame(pd.DataFrame({"caseid": case_ids, "object_value": data}))

for m, n in zip([run_from_json, run_get_json, run_regexp_extract], ["from_json", "get_json", "regexp_extract"]):
  time_run_method(df2, 200, m, n)


    
from_json
Time to count 200 iterations: 16.005220413208008 [sec]
get_json
Time to count 200 iterations: 15.788024187088013 [sec]
regexp_extract
Time to count 200 iterations: 16.81353187561035 [sec]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM