在不进行RDD转换的情况下将Pyspark数据框列转换为dict

Question

I have a Spark dataframe where columns are integers: 我有一个Spark数据框，其中列是整数：

MYCOLUMN:
1
1
2
5
5
5
6

The goal is to get the output equivalent to collections.Counter([1,1,2,5,5,5,6]) . 目的是使输出等于collections.Counter([1,1,2,5,5,5,6]) 。 I can achieve the desired result by transforming the column to RDD, calling collect and the Counter, but this is rather slow for large data frames. 我可以通过将列转换为RDD并调用collect和Counter来获得所需的结果，但是对于大型数据帧而言，这相当慢。

Is there a better approach that uses dataframes that can achieve the same result? 是否有更好的方法使用可以达到相同结果的数据框？

Answer 1

Maybe groupby and count is similar to what you need. 也许groupby和count类似于您所需要的。 Here is my solution to count each number using dataframe. 这是我使用数据框计算每个数字的解决方案。 I'm not sure if this is going to be faster than using RDD or not. 我不确定这是否会比使用RDD更快。

# toy example
df = spark.createDataFrame(pd.DataFrame([1, 1, 2, 5, 5, 5, 6], columns=['MYCOLUMN']))

df_count = df.groupby('MYCOLUMN').count().sort('MYCOLUMN')

Output from df_count.show() df_count.show() 输出

+--------+-----+
|MYCOLUMN|count|
+--------+-----+
|       1|    2|
|       2|    1|
|       5|    3|
|       6|    1|
+--------+-----+

Now, you can turn to dictionary like Counter using rdd 现在，您可以使用rdd转到Counter字典

dict(df_count.rdd.map(lambda x: (x['MYCOLUMN'], x['count'])).collect())

This will give output as {1: 2, 2: 1, 5: 3, 6: 1} 这将输出为{1: 2, 2: 1, 5: 3, 6: 1}

在不进行RDD转换的情况下将Pyspark数据框列转换为dict

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-04-27 01:09:08

在不进行RDD转换的情况下将Pyspark数据框列转换为dict

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-04-27 01:09:08

解决方案1
2 已采纳 2017-04-27 01:09:08