简体   繁体   English

在Spark中将各种大小的元组的RDD转换为DataFrame

[英]Convert a RDD of Tuples of Varying Sizes to a DataFrame in Spark

I am having difficulty in converting an RDD of the follwing structure to a dataframe in spark using python. 我在使用python将folwing结构的RDD转换为spark中的数据帧时遇到困难。

df1=[['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),(itm22,6)]]

After converting, my dataframe should look like the following: 转换后,我的数据框应如下所示:

       usr1  usr2
itm1    2.0   NaN
itm2    NaN   3.0
itm22   NaN   6.0
itm3    3.0   5.0

I was initially thinking of coverting the above RDD structure to the following: 我最初考虑将上述RDD结构覆盖到以下内容:

df1={'usr1': {'itm1': 2, 'itm3': 3}, 'usr2': {'itm2': 3, 'itm3': 5, 'itm22':6}}

Then use python's pandas module pand=pd.DataFrame(dat2) and then convert pandas dataframe back to a spark dataframe using spark_df = context.createDataFrame(pand) . 然后使用python的pandas模块pand=pd.DataFrame(dat2) ,然后使用spark_df = context.createDataFrame(pand) pand=pd.DataFrame(dat2)将pandas数据帧转换回spark数据帧。 However, I beleive, by doing this, I am converting an RDD to a non-RDD object and then converting back to RDD, which is not correct. 但是,我相信,这样做是将RDD转换为非RDD对象,然后再转换回RDD,这是不正确的。 Can some please help me out with this problem? 可以帮我解决这个问题吗?

With data like this: 使用这样的数据:

rdd = sc.parallelize([
    ['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),('itm22',6)]
])

flatten the records: 整理记录:

def to_record(kvs):
    user, *vs = kvs  # For Python 2.x use standard indexing / splicing
    for item, value in vs:
        yield user, item, value

records = rdd.flatMap(to_record)

convert to DataFrame : 转换为DataFrame

df = records.toDF(["user", "item", "value"])

pivot: 枢:

result = df.groupBy("item").pivot("user").sum()

result.show()
## +-----+----+----+
## | item|usr1|usr2|
## +-----+----+----+
## | itm1|   2|null|
## | itm2|null|   3|
## | itm3|   3|   5|
## |itm22|null|   6|
## +-----+----+----+

Note : Spark DataFrames are designed to handle long and relatively thin data. 注意 :Spark DataFrames设计用于处理较长且相对较薄的数据。 If you want to generate wide contingency table, DataFrames won't be useful, especially if data is dense and you want to keep separate column per feature. 如果要生成宽列联表,则DataFrames不会有用,尤其是在数据密集且要为每个功能保留单独的列的情况下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM