[英]How to display row as dictionary from pyspark dataframe?
pyspark 的新手。
我有 2 個數據集, Events
和Gadget
。 他們看起來像這樣:
Events
Gadgets
我可以像這樣使用來讀取和加入 2 個數據幀,並在我的最后一行中只顯示所需的列:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.types import ArrayType, DoubleType, BooleanType
from pyspark.sql.functions import col,array_contains
spark = SparkSession.builder.appName('PySpark Read CSV').getOrCreate()
# Reading csv file
events = spark.read.option("header",True).csv("events.csv")
events.printSchema()
gadgets = spark.read.option("header",True).csv("gadgets.csv")
gadgets.printSchema()
enrich = events.join(gadgets, events.deviceId == gadgets.ID).select(events["*"],gadgets["User"])
我的任務是要求我在字典對象中像這樣呈現數據:
充實任務:
{
sessionId: string
deviceId: string
timestamp: timestamp
type: emun(ADDED_TO_CART | APP_OPENED)
total_price: 50.00
user: string
}
我可以處理分配要求的 dtype 更改和列名重命名,但是如何以上面的字典格式提供我的結果?
如果我使用此行,我不確定如何顯示我的結果:
enrich.rdd.map(lambda row: row.asDict())
使用create_map()函數創建每列及其值的(鍵,值)對。
create_map
需要以形式輸入 (key1, value1, key2, value2, ...)。 為此,請使用itertools.chain() 。
df = spark.createDataFrame(data=[["sess1","dev1","2022-12-19","emun(ADDED_TO_CART | APP_OPENED)","50.00","usr1"],["sess2","dev2","2022-12-18","emun(ADDED_TO_CART | APP_OPENED)","100.00","usr2"]], schema=["sessionId","deviceId","timestamp","type","total_price","user"])
import pyspark.sql.functions as F
import itertools
df = df.withColumn("map", \
F.create_map( \
list(itertools.chain( \
*((F.lit(x), F.col(x)) for x in df.columns) \
)) \
))
df.show(truncate=False)
輸出:
+---------+--------+----------+--------------------------------+-----------+----+----------------------------------------------------------------------------------------------------------------------------------------------+
|sessionId|deviceId|timestamp |type |total_price|user|map |
+---------+--------+----------+--------------------------------+-----------+----+----------------------------------------------------------------------------------------------------------------------------------------------+
|sess1 |dev1 |2022-12-19|emun(ADDED_TO_CART | APP_OPENED)|50.00 |usr1|{sessionId -> sess1, deviceId -> dev1, timestamp -> 2022-12-19, type -> emun(ADDED_TO_CART | APP_OPENED), total_price -> 50.00, user -> usr1} |
|sess2 |dev2 |2022-12-18|emun(ADDED_TO_CART | APP_OPENED)|100.00 |usr2|{sessionId -> sess2, deviceId -> dev2, timestamp -> 2022-12-18, type -> emun(ADDED_TO_CART | APP_OPENED), total_price -> 100.00, user -> usr2}|
+---------+--------+----------+--------------------------------+-----------+----+----------------------------------------------------------------------------------------------------------------------------------------------+
您還可以使用以下方法將其收集為json :
df = df.withColumn("json", F.to_json("map"))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.