I have data in the following format, which is obtained from Hive into a dataframe:
date, stock, price
1388534400, GOOG, 50
1388534400, FB, 60
1388534400, MSFT, 55
1388620800, GOOG, 52
1388620800, FB, 61
1388620800, MSFT, 55
Where date is the epoch for midnight on that day, and we have data going back 10 years or so (800million+ rows). My aim is to end up with a bunch of JSON files, one per stock, that would look like:
GOOG.json:
{
'1388534400': 50,
'1388620800': 52
}
FB.json:
{
'1388534400': 60,
'1388620800': 61
}
A naive way would be to get a list of unique stocks and then get a subset of the dataframe by filtering out only those rows for each stock but this seems overly naive and horribly inefficient. Can this be done easily in Spark? I've currently got it working in native Python using PyHive, but due to the sheer volume of data, I'd rather have this done on a cluster/Spark.
Yes. This is quite straightforward. You can use a DataFrameWriter and use partitionBy - specify the column(s) to partition on (in your case it would be stock)
From the Pyspark documentation:
df.write.partitionBy('year', 'month').parquet(os.path.join(tempfile.mkdtemp(), 'data'))
For you this would be
df.write.partitionBy('stock').json(os.path.join(tempfile.mkdtemp(), 'data'))
Note a few things:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.