如何将中间结果存储在 pyspark reduceByKey 函数中？

Question

This is a case for calculating average hold cost.这是计算平均持有成本的情况。 We only consider the trades that increase the account balance, regardless of the trades that decrease the account balance.我们只考虑增加账户余额的交易，而不管减少账户余额的交易。

# data example: ((1,'000001'),('A',0,5000,5000)),
# (1,'000001') is the groupby key , 'A' is order by key (serialno) , '0' is the account balance
# before the trade, '5000' is trade balance, '5000' is the account balance aftre the trade. We aim
# to calculate #the average cost per unit after the trades in each group by spark rdd.

confirm = [
    ((1, '000001'), ('A', 0, 5000, 5000)),
    ((1, '000001'), ('C', 9000, 1000, 10000)),
    ((1, '000001'), ('B', 5000, 5000, 9000)),
    ((2, '000001'), ('D', 0, 3300, 3000)),
    ((2, '000001'), ('F', 4000, 5000, 10000)),
    ((2, '000001'), ('E', 3000, 4200, 6000)),
    ((3, '000001'), ('G', 0, 3300, 3000)),
    ((3, '000001'), ('H', 3000, 3300, 6300))
]


def my_partition(x):
    return x[0] % 3


def partSort(x):
    xlist = list(x)
    a = sorted(xlist, key=lambda x: x[1][0])
    return iter(a)


import pandas as pd


def udf_func(x, y):
    if y is None:

        result = x[2] / x[3]
        df = pd.DataFrame([{'serialno': x[0], 'result': result}])
    else:
        result = (
                         (x if isinstance(x, float) else (x[2] / x[3])) * y[1] + y[2]
                 ) / y[3]

        df = pd.DataFrame([{'serialno': y[0], 'result': result}])
    # this is where I want to store the intermediate result,but does not work eg:
    df.to_csv("/home/zo_om/result.csv", 'a')
    return result


rdd = sc.parallelize(confirm).partitionBy(3, my_partition). \
    mapPartitions(partSort).reduceByKey(udf_func)
rdd.collect()

After I run the code, the result is:我运行代码后，结果是：

[
 ((3, '000001'), 1.0476190476190477),
 ((2, '000001'), 1.0),
 ((1, '000001'), 1.1)
]

which is last result of each group.这是每组的最后结果。

I can see only 1 row in the "/home/zo_om/result.csv" ( only in one work node of spark cluster, zo_om is the kerberos user).我只能在“/home/zo_om/result.csv”中看到 1 行（仅在 Spark 集群的一个工作节点中，zo_om 是 kerberos 用户）。 What I expect to see is 8 rows ( one for each serialno （'A'~'H'） )我希望看到的是 8 行（每个序列号一个（'A'~'H'））

Answer 1

我猜您只看到一行，因为pd.DataFrame.to_csv覆盖了现有数据并且您一直写入相同的路径

如何将中间结果存储在 pyspark reduceByKey 函数中？

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-12-25 13:43:29

如何将中间结果存储在 pyspark reduceByKey 函数中？

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-12-25 13:43:29

解决方案1
0 已采纳 2019-12-25 13:43:29