简体   繁体   English

在不进行RDD转换的情况下将Pyspark数据框列转换为dict

[英]Convert Pyspark dataframe column to dict without RDD conversion

I have a Spark dataframe where columns are integers: 我有一个Spark数据框,其中列是整数:

MYCOLUMN:
1
1
2
5
5
5
6

The goal is to get the output equivalent to collections.Counter([1,1,2,5,5,5,6]) . 目的是使输出等于collections.Counter([1,1,2,5,5,5,6]) I can achieve the desired result by transforming the column to RDD, calling collect and the Counter, but this is rather slow for large data frames. 我可以通过将列转换为RDD并调用collect和Counter来获得所需的结果,但是对于大型数据帧而言,这相当慢。

Is there a better approach that uses dataframes that can achieve the same result? 是否有更好的方法使用可以达到相同结果的数据框?

Maybe groupby and count is similar to what you need. 也许groupbycount类似于您所需要的。 Here is my solution to count each number using dataframe. 这是我使用数据框计算每个数字的解决方案。 I'm not sure if this is going to be faster than using RDD or not. 我不确定这是否会比使用RDD更快。

# toy example
df = spark.createDataFrame(pd.DataFrame([1, 1, 2, 5, 5, 5, 6], columns=['MYCOLUMN']))

df_count = df.groupby('MYCOLUMN').count().sort('MYCOLUMN')

Output from df_count.show() df_count.show() 输出

+--------+-----+
|MYCOLUMN|count|
+--------+-----+
|       1|    2|
|       2|    1|
|       5|    3|
|       6|    1|
+--------+-----+

Now, you can turn to dictionary like Counter using rdd 现在,您可以使用rdd转到Counter字典

dict(df_count.rdd.map(lambda x: (x['MYCOLUMN'], x['count'])).collect())

This will give output as {1: 2, 2: 1, 5: 3, 6: 1} 这将输出为{1: 2, 2: 1, 5: 3, 6: 1}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM