簡體   English   中英

PySpark - 重命名存儲為 CSV 文件中的字符串列的 JSON 中的鍵名

[英]PySpark - rename key names in JSON stored as string column in CSV file

我想更新存儲為字符串列的 json 中的鍵名,並將其另存為字符串類型列。 我正在從我的 csv 中讀取這些列並將其存儲為 csv。

這就是我的輸入 csv 的樣子。

candidate_email,transactions
cust2@email.com,"[{'transaction_id':'12', 'transaction_amount':'$23.43'},{'transaction_id':'15', 'transaction_amount':'$723.41'}]"
cust1@email.com,"[{'transaction_id':'10', 'transaction_amount':'$55.99'},{'transaction_id':'11', 'transaction_amount':'$20.46'},{'transaction_id':'13', 'transaction_amount':'$5.89'},{'transaction_id':'14', 'transaction_amount':'$35.61'}]"

我想更換transaction_id與鍵idtransaction_amountamount在我的JSON和救回來為csv。

input_df = spark.read.csv('transactions/*.csv', header='true', inferSchema = True)
input_df.printSchema()
# root
#  |-- candidate_email: string (nullable = true)
#  |-- transactions: string (nullable = true)

input_df.show(10, False)
# +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |candidate_email|transactions                                                                                                                                                                                                                |
# +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |cust2@email.com        |[{'transaction_id':'12', 'transaction_amount':'$23.43'},{'transaction_id':'15', 'transaction_amount':'$723.41'}]                                                                                                            |
# |cust1@email.com        |[{'transaction_id':'10', 'transaction_amount':'$55.99'},{'transaction_id':'11', 'transaction_amount':'$20.46'},{'transaction_id':'13', 'transaction_amount':'$5.89'},{'transaction_id':'14', 'transaction_amount':'$35.61'}]|
# +-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

如何更換我的鑰匙以獲得以下輸出?

output_df.show(10,False)
# +---------------+----------------------------------------------------------------------------------------------------------------------------+
# |candidate_email|transactions                                                                                                                |
# +---------------+----------------------------------------------------------------------------------------------------------------------------+
# |cust1@email.com|[{'id':'10', 'amount':'$55.99'},{'id':'11', 'amount':'$20.46'},{'id':'13', 'amount':'$5.89'},{'id':'14', 'amount':'$35.61'}]|
# |cust2@email.com|[{'id':'12', 'amount':'$23.43'},{'id':'15', 'amount':'$723.41'}]                                                            |
# +---------------+----------------------------------------------------------------------------------------------------------------------------+

注意:兩列都是字符串類型的列。

output_df.printSchema()
# root
#  |-- candidate_email: string (nullable = true)
#  |-- transactions: string (nullable = true)

使用from_json將交易列作為array(struct...)讀取,然后轉換為所需的字段名稱。

  • 然后explode + to_json + groupBy + collect_list以獲得所需的 json。

Example:

df.show()
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|candidate_email|transactions                                                                                                    |
#+---------------+----------------------------------------------------------------------------------------------------------------+
#|cust2@email.com|[{'transaction_id':'12', 'transaction_amount':'$23.43'},{'transaction_id':'15', 'transaction_amount':'$723.41'}]|
#+---------------+----------------------------------------------------------------------------------------------------------------+

st=ArrayType(StructType([StructField("transaction_id", StringType()),StructField("transaction_amount", StringType())]))

df.withColumn("jsn",from_json(col("transactions"),st).cast("array<struct<id:string,amount:string>>")).\
selectExpr("*","explode(jsn)").\
select("*","col.*").\
drop(*drop_cols).\
selectExpr("candidate_email","to_json(struct(id,amount)) as trans").\
groupBy("candidate_email").\
agg(collect_list("trans").alias("transactions")).\
show(10,False)

#+---------------+---------------------------------------------------------------+
#|candidate_email|transactions                                                   |
#+---------------+---------------------------------------------------------------+
#|cust2@email.com|[{"id":"12","amount":"$23.43"}, {"id":"15","amount":"$723.41"}]|
#+---------------+---------------------------------------------------------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM