[英]Convert a RDD of Tuples of Varying Sizes to a DataFrame in Spark
I am having difficulty in converting an RDD of the follwing structure to a dataframe in spark using python. 我在使用python将folwing结构的RDD转换为spark中的数据帧时遇到困难。
df1=[['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),(itm22,6)]]
After converting, my dataframe should look like the following: 转换后,我的数据框应如下所示:
usr1 usr2
itm1 2.0 NaN
itm2 NaN 3.0
itm22 NaN 6.0
itm3 3.0 5.0
I was initially thinking of coverting the above RDD structure to the following: 我最初考虑将上述RDD结构覆盖到以下内容:
df1={'usr1': {'itm1': 2, 'itm3': 3}, 'usr2': {'itm2': 3, 'itm3': 5, 'itm22':6}}
Then use python's pandas module pand=pd.DataFrame(dat2)
and then convert pandas dataframe back to a spark dataframe using spark_df = context.createDataFrame(pand)
. 然后使用python的pandas模块
pand=pd.DataFrame(dat2)
,然后使用spark_df = context.createDataFrame(pand)
pand=pd.DataFrame(dat2)
将pandas数据帧转换回spark数据帧。 However, I beleive, by doing this, I am converting an RDD to a non-RDD object and then converting back to RDD, which is not correct. 但是,我相信,这样做是将RDD转换为非RDD对象,然后再转换回RDD,这是不正确的。 Can some please help me out with this problem?
可以帮我解决这个问题吗?
With data like this: 使用这样的数据:
rdd = sc.parallelize([
['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),('itm22',6)]
])
flatten the records: 整理记录:
def to_record(kvs):
user, *vs = kvs # For Python 2.x use standard indexing / splicing
for item, value in vs:
yield user, item, value
records = rdd.flatMap(to_record)
convert to DataFrame
: 转换为
DataFrame
:
df = records.toDF(["user", "item", "value"])
pivot: 枢:
result = df.groupBy("item").pivot("user").sum()
result.show()
## +-----+----+----+
## | item|usr1|usr2|
## +-----+----+----+
## | itm1| 2|null|
## | itm2|null| 3|
## | itm3| 3| 5|
## |itm22|null| 6|
## +-----+----+----+
Note : Spark DataFrames
are designed to handle long and relatively thin data. 注意 :Spark
DataFrames
设计用于处理较长且相对较薄的数据。 If you want to generate wide contingency table, DataFrames
won't be useful, especially if data is dense and you want to keep separate column per feature. 如果要生成宽列联表,则
DataFrames
不会有用,尤其是在数据密集且要为每个功能保留单独的列的情况下。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.