[英]How to process pyspark dataframe as group by column value
I have a huge dataframe of different item_id
and its related data, I need to process each group with the item_id
serparately in parallel, I tried the to repartition
the dataframe by item_id
using the below code, but it seems it's still being processed as a whole not chunks我有一个巨大的 dataframe 不同的
item_id
及其相关数据,我需要使用item_id
并行处理每个组,我尝试使用下面的整个代码按item_id
repartition
dataframe,但似乎它仍然被处理为不是块
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
result = data.repartition('ITEM_ID') \
.rdd \
.mapPartitions(lambda iter: pd.DataFrame(list(iter), columns=columns))\
.mapPartitions(scan_item_best_model)\
.collect()
also is repartition
is the correct approach or there is something am doing wrong?也是
repartition
是正确的方法还是做错了什么?
after looking around I found this which addresses a similar problem, finally I had to solve it like环顾四周后,我发现这个解决了类似的问题,最后我不得不像这样解决它
data = sqlContext.read.csv(path='/user/data', header=True)
columns = data.columns
df = data.select("ITEM_ID", F.struct(columns).alias("df"))
df = df.groupBy('ITEM_ID').agg(F.collect_list('df').alias('data'))
df = df.rdd.map(lambda big_df: (big_df['ITEM_ID'], pd.DataFrame.from_records(big_df['data'], columns=columns))).map(
scan_item_best_model)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.