简体   繁体   中英

Handle JSON objects in CSV File and save to PySpark DataFrame

I have a CSV file which contains JSON objects as well as other data like String, Integer in it. If I try to read the file as CSV then the JSON objects overlaps in other columns.

 Column1, Column2, Column3, Column4, Column5 100,ABC,{"abc": [{"xyz": 0, "mno": "h"}, {"apple": 0, "hello": 1, "temp": "cnot"}]},foo, pine 101,XYZ,{"xyz": [{"abc": 0, "mno": "h"}, {"apple": 0, "hello": 1, "temp": "cnot"}]},bar, apple

I am getting output as:

Column1 | Column2 | Column3 | Column4 | Column5
100 | ABC | {"abc": [{"xyz": 0, "mno": "h"} | {"apple": 0, "hello": 1 | "temp": "cnot"}]}

101 | XYZ | {"xyz": [{"abc": 0, "mno": "h"} | {"xyz": [{"abc": 0, "mno": "h"} | "temp": "cnot"}]}

Test_File.py

from pyspark.sql import SQLContext 
from pyspark.sql.types import *
    
# Initializing SparkSession and setting up the file source
filepath = "s3a://file.csv"
df = spark.read.format("csv").options(header="true", delimiter = ',', inferschema='true').load(filepath)
df.show(5)

Also tried handling this issue by reading the file as text as discussed in this approach

'100,ABC,"{\'abc\':["{\'xyz\':0,\'mno\':\'h\'}","{\'apple\':0,\'hello\':1,\'temp\':\'cnot\’}”]}”, foo, pine'

'101,XYZ,"{\'xyz\':["{\'abc\':0,\'mno\':\'h\'}","{\'apple\':0,\'hello\':1,\'temp\':\'cnot\’}”]}”, bar, apple'

But instead of creating a new file, I wanted to load this quoted string as the PySpark DataFrame to run the SQL Queries on them, to create a DataFrame I need to split this again to assign each column to PySpark which results in splitting the JSON Object again.

The issue is with the delimiter you are using. You are reading CSV with comma as a delimiter and your JSON string contains commas. Hence Spark is splitting the JSON string also on coma therefore the above output. You will need to have a CSV with a delimiter which is unique and will not be present in either of the column value so as to overcome your case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM