如何按列值将 pyspark dataframe 处理为分组

Question

I have a huge dataframe of different item_id and its related data, I need to process each group with the item_id serparately in parallel, I tried the to repartition the dataframe by item_id using the below code, but it seems it's still being processed as a whole not chunks我有一个巨大的 dataframe 不同的item_id及其相关数据，我需要使用item_id并行处理每个组，我尝试使用下面的整个代码按item_id repartition dataframe，但似乎它仍然被处理为不是块

data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns    
result = data.repartition('ITEM_ID') \
        .rdd \
        .mapPartitions(lambda iter: pd.DataFrame(list(iter), columns=columns))\
        .mapPartitions(scan_item_best_model)\
        .collect()

also is repartition is the correct approach or there is something am doing wrong?也是repartition是正确的方法还是做错了什么？

Answer 1

after looking around I found this which addresses a similar problem, finally I had to solve it like环顾四周后，我发现这个解决了类似的问题，最后我不得不像这样解决它

data = sqlContext.read.csv(path='/user/data', header=True)

columns = data.columns

df = data.select("ITEM_ID", F.struct(columns).alias("df"))

df = df.groupBy('ITEM_ID').agg(F.collect_list('df').alias('data'))

df = df.rdd.map(lambda big_df: (big_df['ITEM_ID'], pd.DataFrame.from_records(big_df['data'], columns=columns))).map(
    scan_item_best_model)

如何按列值将 pyspark dataframe 处理为分组

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-04-23 14:53:51

如何按列值将 pyspark dataframe 处理为分组

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-04-23 14:53:51

解决方案1
0 已采纳 2020-04-23 14:53:51