如何在pyspark中将JSON字符串转换为JSON对象

Question

I have one of column type of data frame is string but actually it is containing json object of 4 schema where few fields are common. 我有数据框的列类型之一是字符串，但实际上它包含4个架构的json对象，其中很少有字段是常见的。 I need to convert that into jason object. 我需要将其转换为杰森对象。

Here is schema of data frame : 这是数据框架的架构：

query.printSchema() query.printSchema（）

root
 |-- test: string (nullable = true)

value of DF looks like DF的值看起来像

query.show(10) query.show（10）

+--------------------+
|                test|
+--------------------+
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
+--------------------+
only showing top 10 rows

What solution i applied :: 我应用了什么解决方案::

write into text file 写入文本文件

query.write.format("text").mode('overwrite').save("s3://bucketname/temp/") query.write.format（ “文本”）.mode（ '覆盖'）保存（ “S3：// bucketname /温度/”）。

read as json 读为json

df = spark.read.json("s3a://bucketname/temp/") df = spark.read.json（“ s3a：// bucketname / temp /”）

now print Schema, It is json string for each row already converted into json object 现在打印模式，这是已转换为json对象的每一行的json字符串

df.printSchema() df.printSchema（）

 root |-- EventDate: string (nullable = true) |-- EventId: string (nullable = true) |-- EventNotificationType: long (nullable = true) |-- Interaction: struct (nullable = true) | |-- ContextId: string (nullable = true) | |-- Created: string (nullable = true) | |-- Description: string (nullable = true) | |-- Id: string (nullable = true) | |-- ModelContextId: string (nullable = true) |-- PurchaseActivity: struct (nullable = true) | |-- BillingCity: string (nullable = true) | |-- BillingCountry: string (nullable = true) | |-- ShippingAndHandlingAmount: double (nullable = true) | |-- ShippingDiscountAmount: double (nullable = true) | |-- SubscriberId: long (nullable = true) | |-- SubscriptionOriginalEndDate: string (nullable = true) |-- SubscriptionChurn: struct (nullable = true) | |-- PaymentTypeCode: long (nullable = true) | |-- PaymentTypeName: string (nullable = true) | |-- PreviousPaidAmount: double (nullable = true) | |-- SubscriptionRemoved: string (nullable = true) | |-- SubscriptionStartDate: string (nullable = true) |-- TransactionDetail: struct (nullable = true) | |-- Amount: double (nullable = true) | |-- OrderShipToCountry: string (nullable = true) | |-- PayPalUserName: string (nullable = true) | |-- PaymentSubTypeCode: long (nullable = true) | |-- PaymentSubTypeName: string (nullable = true)

Is there any best way where i don't need to write dataframe as text file and read it again as json file to get expected output 有什么最好的方法，我不需要将dataframe作为文本文件写入并再次作为json文件读取以获得预期的输出

Answer 1

You can use from_json() before you write into text file, but you need to define the schema first. 在写入文本文件之前，可以使用from_json() ，但是需要首先定义架构。

the code look like this : 代码看起来像这样：

data = query.select(from_json("test",schema=schema).alias("value")).selectExpr("value.*")

data.write.format("text").mode('overwrite').save("s3://bucketname/temp/")

如何在pyspark中将JSON字符串转换为JSON对象

问题描述

1 个解决方案

解决方案1
0 2018-12-31 05:00:31

如何在pyspark中将JSON字符串转换为JSON对象

问题描述

1 个解决方案

解决方案1 0 2018-12-31 05:00:31

解决方案1
0 2018-12-31 05:00:31