I have a JSON file saved in S3 that I am trying to open/read/store/whatever as a dict or struct in PySpark. It looks something like this:
{
"filename": "some_file.csv",
"md5": "md5 hash",
"client_id": "some uuid",
"mappings": {
"shipping_city": "City",
"shipping_country": "Country",
"shipping_zipcode": "Zip",
"shipping_address1": "Street Line 1",
"shipping_address2": "Street Line 2",
"shipping_state_abbreviation": "State"
}
}
And I would like to read it from S3 and store it as a dictionary or struct. When I read it like so:
inputJSON = "s3://bucket/file.json"
dfJSON = sqlContext.read.json(inputJSON, multiLine=True)
I get a dataframe that drops the mappings and looks like this:
+---------+-------------+----------------------------------------------------------+-------+
|client_id|filename |mappings |md5 |
+-----------------------+----------------------------------------------------------+-------+
|some uuid|some_file.csv|[City, Country, Zip, Street Line 1, Street Line 2, State] |md5hash|
+-----------------------+----------------------------------------------------------+-------+
Is it possible to open the file and read it into a dictionary so I could access the Mappings or other stuff like this?:
jsonDict = inputFile
mappingDict = jsonDict['mappings']
You can try something like this:
inputJSON = "/tmp/some_file.json"
dfJSON = spark.read.json(inputJSON, multiLine=True)
dfJSON.printSchema()
root
|-- client_id: string (nullable = true)
|-- filename: string (nullable = true)
|-- mappings: struct (nullable = true)
| |-- shipping_address1: string (nullable = true)
| |-- shipping_address2: string (nullable = true)
| |-- shipping_city: string (nullable = true)
| |-- shipping_country: string (nullable = true)
| |-- shipping_state_abbreviation: string (nullable = true)
| |-- shipping_zipcode: string (nullable = true)
|-- md5: string (nullable = true)
dict_mappings = dfJSON.select("mappings").toPandas().set_index('mappings').T.to_dict('list')
dict_mappings
{Row(shipping_address1='Street Line 1', shipping_address2='Street Line 2', shipping_city='City', shipping_country='Country', shipping_state_abbreviation='State', shipping_zipcode='Zip'): []}
OR (without Pandas)
list_map = map(lambda row: row.asDict(), dfJSON.select("mappings").collect())
dict_mappings2 = {t['mappings']: t for t in list_map}
I was able to solve this by adding boto3 to the EMR cluster and using the following code:
import boto3
import json
s3 = boto3.resource('s3')
obj = s3.Object('slm-transaction-incoming','All_Starbucks_Locations_in_the_US.json')
string = obj.get()['Body'].read().decode('utf-8')
json = json.loads(string)
Adding boto3 can be done by typing the following into the EMR Terminal:
sudo pip-3.6 install boto3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.