[英]How do I count frequency of each categorical variable in a column in pyspark dataframe for multiple columns?
[英]In StructStreaming of pyspark; How do I convert each row (a json-formatted string) in the DataFrame into multiple columns
我的 DataFrame 結構看起來像這樣
+--------------------+
| values|
+--------------------+
|{"user_id":"00000...|
+--------------------+
然后這里的字符串結構是這樣的
{
"user_id":"00000000002",
"client_args":{
"order_by":"id",
"page":"4",
"keyword":"Blue flowers",
"reverse":"false"
},
"keyword_tokenizer":[
"Blue",
"flowers"
],
"items":[
"00000065678",
"00000065707",
"00000065713",
"00000065741",
"00000065753",
"00000065816",
"00000065875",
"00000066172"
]
}
我希望這個 DataFrame 看起來像這樣
+---------------+-------------------+------------------+----------------------------+
| user_id | client_args | keyword_tokenizer| items |
+---------------+-------------------+------------------+----------------------------+
|00000000000001 |{"order_by":"",...}|["Blue","flowers"]|["000006578","00002458",...]|
+---------------+-------------------+------------------+----------------------------+
我的代碼看起來像這樣
lines = spark_session\
.readStream\
.format("socket")\
.option("host", "127.0.0.1")\
.option("port", 9998)\
.load()
@f.udf("struct<user_id:string,client_args:string,keyword_tokenizer:array>")
def str_to_json(s):
return json.loads(s)
lines.select(str_to_json(lines.values))
但這只會將它們轉換為 JSON,不能進行列拆分。 我應該怎么辦?
另外:我后來找到了這個方法來解決這個問題。 那效率低嗎?
schema = StructType([StructField("user_id",StringType()),
StructField("client_args", StructType([
StructField("order_by", StringType()),
StructField("page", StringType()),
StructField("keyword", StringType()),
StructField("reverse", StringType()),
])),
StructField("keyword_tokenizer", ArrayType(StringType())),
StructField("items", ArrayType(StringType()))])
new_df = lines.withColumn("tmp", f.from_json(lines.values, schema))\
.withColumn("user_id", f.col("tmp").getItem("user_id"))\
.withColumn("client_args", f.col("tmp").getItem("client_args"))\
.withColumn("keyword_tokenizer", f.col("tmp").getItem("keyword_tokenizer"))\
.withColumn("items", f.col("tmp").getItem("items"))\
.drop("value", "tmp")
使用 pyspark 讀取為 json 文件
df = spark.read.json("test.json")
df.show()
+--------------------+--------------------+-----------------+-----------+
| client_args| items|keyword_tokenizer| user_id|
+--------------------+--------------------+-----------------+-----------+
|[Blue flowers, id...|[00000065678, 000...| [Blue, flowers]|00000000002|
+--------------------+--------------------+-----------------+-----------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.