PySpark - create multiple json files from dataframe

Question

I have data in the following format, which is obtained from Hive into a dataframe:

date, stock, price
1388534400, GOOG, 50
1388534400, FB, 60
1388534400, MSFT, 55
1388620800, GOOG, 52
1388620800, FB, 61
1388620800, MSFT, 55

Where date is the epoch for midnight on that day, and we have data going back 10 years or so (800million+ rows). My aim is to end up with a bunch of JSON files, one per stock, that would look like:

GOOG.json:
{
'1388534400': 50,
'1388620800': 52
}

FB.json:
{
'1388534400': 60,
'1388620800': 61
}

A naive way would be to get a list of unique stocks and then get a subset of the dataframe by filtering out only those rows for each stock but this seems overly naive and horribly inefficient. Can this be done easily in Spark? I've currently got it working in native Python using PyHive, but due to the sheer volume of data, I'd rather have this done on a cluster/Spark.

Answer 1

Yes. This is quite straightforward. You can use a DataFrameWriter and use partitionBy - specify the column(s) to partition on (in your case it would be stock)

From the Pyspark documentation:

df.write.partitionBy('year', 'month').parquet(os.path.join(tempfile.mkdtemp(), 'data'))

For you this would be

df.write.partitionBy('stock').json(os.path.join(tempfile.mkdtemp(), 'data'))

Note a few things:

This will possibly need a lot of shuffling depending on how the Hive Tables are laid out.
Even after the partitions you might end up with more than one file per partition depending on how many records there are in that partition. For eg 30% of the activity might be for GOOG and in that case the partition for GOOG will be much larger than others. You'll simply need to run a file catenation script for each partition if you run into that. The files in each partition, however, will still be for one single stock.

PySpark - create multiple json files from dataframe

Question

1 answers

solution1
0 2018-12-25 23:49:47

PySpark - create multiple json files from dataframe

Question

1 answers

solution1 0 2018-12-25 23:49:47

solution1
0 2018-12-25 23:49:47