在Spark中将各种大小的元组的RDD转换为DataFrame

Question

I am having difficulty in converting an RDD of the follwing structure to a dataframe in spark using python. 我在使用python将folwing结构的RDD转换为spark中的数据帧时遇到困难。

df1=[['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),(itm22,6)]]

After converting, my dataframe should look like the following: 转换后，我的数据框应如下所示：

       usr1  usr2
itm1    2.0   NaN
itm2    NaN   3.0
itm22   NaN   6.0
itm3    3.0   5.0

I was initially thinking of coverting the above RDD structure to the following: 我最初考虑将上述RDD结构覆盖到以下内容：

df1={'usr1': {'itm1': 2, 'itm3': 3}, 'usr2': {'itm2': 3, 'itm3': 5, 'itm22':6}}

Then use python's pandas module pand=pd.DataFrame(dat2) and then convert pandas dataframe back to a spark dataframe using spark_df = context.createDataFrame(pand) . 然后使用python的pandas模块pand=pd.DataFrame(dat2) ，然后使用spark_df = context.createDataFrame(pand) pand=pd.DataFrame(dat2)将pandas数据帧转换回spark数据帧。 However, I beleive, by doing this, I am converting an RDD to a non-RDD object and then converting back to RDD, which is not correct. 但是，我相信，这样做是将RDD转换为非RDD对象，然后再转换回RDD，这是不正确的。 Can some please help me out with this problem? 可以帮我解决这个问题吗？

Answer 1

With data like this: 使用这样的数据：

rdd = sc.parallelize([
    ['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),('itm22',6)]
])

flatten the records: 整理记录：

def to_record(kvs):
    user, *vs = kvs  # For Python 2.x use standard indexing / splicing
    for item, value in vs:
        yield user, item, value

records = rdd.flatMap(to_record)

convert to DataFrame : 转换为DataFrame ：

df = records.toDF(["user", "item", "value"])

pivot: 枢：

result = df.groupBy("item").pivot("user").sum()

result.show()
## +-----+----+----+
## | item|usr1|usr2|
## +-----+----+----+
## | itm1|   2|null|
## | itm2|null|   3|
## | itm3|   3|   5|
## |itm22|null|   6|
## +-----+----+----+

Note : Spark DataFrames are designed to handle long and relatively thin data. 注意：Spark DataFrames设计用于处理较长且相对较薄的数据。 If you want to generate wide contingency table, DataFrames won't be useful, especially if data is dense and you want to keep separate column per feature. 如果要生成宽列联表，则DataFrames不会有用，尤其是在数据密集且要为每个功能保留单独的列的情况下。

在Spark中将各种大小的元组的RDD转换为DataFrame

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-05-31 19:39:52

在Spark中将各种大小的元组的RDD转换为DataFrame

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-05-31 19:39:52

解决方案1
2 已采纳 2016-05-31 19:39:52