简体   繁体   中英

PySpark: create dict of dicts from dataframe?

I have data in the following format, which is obtained from Hive into a dataframe:

date, stock, price
1388534400, GOOG, 50
1388534400, FB, 60
1388534400, MSFT, 55
1388620800, GOOG, 52
1388620800, FB, 61
1388620800, MSFT, 55

Where date is the epoch for midnight on that day, and we have data going back 10 years or so (800million+ rows). I'd like to get a dictionary as follows:

{
'GOOG':
{
'1388534400': 50,
'1388620800': 52
}

'FB':
{
'1388534400': 60,
'1388620800': 61
}
}

A naive way would be to get a list of unique stocks and then get a subset of the dataframe by filtering out only those rows for each stock but this seems overly naive and horribly inefficient. Can this be done easily in Spark? I've currently got it working in native Python using PyHive, but due to the sheer volume of data, I'd rather have this done on a cluster/Spark.

In spark 2.4 you can use map_from_arrays to build the date-value maps when aggregating values for each stock. Then it's just a matter of use create_map to use the ticker symbol as a key. This example uses ChainMap from python 3.4 to build the final dict structure as you described.

import json
from collections import ChainMap
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession \
    .builder \
    .appName("example") \
    .getOrCreate()

df = spark.createDataFrame([
    (1388534400, "GOOG", 50),
    (1388534400, "FB", 60),
    (1388534400, "MSFT", 55),
    (1388620800, "GOOG", 52),
    (1388620800, "FB", 61),
    (1388620800, "MSFT", 55)]
).toDF("date", "stock", "price")

out = df.groupBy("stock") \
        .agg(
            map_from_arrays(
                collect_list("date"), collect_list("price")).alias("values")) \
        .select(create_map("stock", "values").alias("values")) \
        .rdd.flatMap(lambda x: x) \
        .collect()

print(json.dumps(dict(ChainMap(*out)), indent=4, separators=(',', ': '), sort_keys=True))

Which gives:

{                                                                               
    "FB": {
        "1388534400": 60,
        "1388620800": 61
    },
    "GOOG": {
        "1388534400": 50,
        "1388620800": 52
    },
    "MSFT": {
        "1388534400": 55,
        "1388620800": 55
    }
}

However , as you say you have a lot of data you probably don't actually want to create this dictionary in memory, so probably you would be better of splitting this up and writing the same dictionary structure into files for different partitions.

Let's do that by truncating the dates to the given month and writing seperate file for each month and for each stock:

out = df.groupBy(trunc(expr("CAST(date as TIMESTAMP)"), "month").alias("month"), df["stock"]) \
        .agg(
            map_from_arrays(
                collect_list("date"), collect_list("price")).alias("values")) \
        .select("month", "stock", create_map("stock", "values").alias("values"))

out.write.partitionBy("month", "stock").format("json").save("out/prices")

This gives you a structure like the following:

out
└── prices
    ├── _SUCCESS
    └── month=2014-01-01
        ├── stock=FB
        │   └── part-00093-3741bdc2-345a-488e-82da-53bb586cd23b.c000.json
        ├── stock=GOOG
        │   └── part-00014-3741bdc2-345a-488e-82da-53bb586cd23b.c000.json
        └── stock=MSFT
            └── part-00152-3741bdc2-345a-488e-82da-53bb586cd23b.c000.json

And the MSFT file looks like this:

{"values":{"MSFT":{"1388534400":55,"1388620800":55}}}

While the "values" column name may not be in your dictionary structure, I hope this illustrates what you can do.

I am using Spark 2.3.1 This is PySpark version -

from pyspark.sql.functions import udf,collect_list,create_map
from pyspark.sql.types import MapType,IntegerType,StringType

myValues = [('1388534400', 'GOOG', 50), ('1388534400', 'FB', 60), ('1388534400', 'MSFT', 55), ('1388620800', 'GOOG', 52),
('1388620800', 'FB', 61), ('1388620800', 'MSFT', 55)]
df = sqlContext.createDataFrame(myValues,['date','stock','price'])
df.show()
+----------+-----+-----+
|      date|stock|price|
+----------+-----+-----+
|1388534400| GOOG|   50|
|1388534400|   FB|   60|
|1388534400| MSFT|   55|
|1388620800| GOOG|   52|
|1388620800|   FB|   61|
|1388620800| MSFT|   55|
+----------+-----+-----+

combineMap = udf(lambda maps: {key:f[key] for f in maps for key in f},
             MapType(StringType(),IntegerType()))

combineDeepMap = udf(lambda maps: {key:f[key] for f in maps for key in f},
             MapType(StringType(),MapType(StringType(),IntegerType())))

mapdf = df.groupBy('stock')\
.agg(collect_list(create_map('date','price')).alias('maps'))\
.agg(combineDeepMap(collect_list(create_map('stock',combineMap('maps')))))

new_dict= mapdf.collect()[0][0]
print(new_dict)
   {u'GOOG': {u'1388620800': 52, u'1388534400': 50}, u'FB': {u'1388620800': 61, u'1388534400': 60}, u'MSFT': {u'1388620800': 55, u'1388534400': 55}}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM