简体   繁体   中英

Convert Pyspark dataframe column to dict without RDD conversion

I have a Spark dataframe where columns are integers:

MYCOLUMN:
1
1
2
5
5
5
6

The goal is to get the output equivalent to collections.Counter([1,1,2,5,5,5,6]) . I can achieve the desired result by transforming the column to RDD, calling collect and the Counter, but this is rather slow for large data frames.

Is there a better approach that uses dataframes that can achieve the same result?

Maybe groupby and count is similar to what you need. Here is my solution to count each number using dataframe. I'm not sure if this is going to be faster than using RDD or not.

# toy example
df = spark.createDataFrame(pd.DataFrame([1, 1, 2, 5, 5, 5, 6], columns=['MYCOLUMN']))

df_count = df.groupby('MYCOLUMN').count().sort('MYCOLUMN')

Output from df_count.show()

+--------+-----+
|MYCOLUMN|count|
+--------+-----+
|       1|    2|
|       2|    1|
|       5|    3|
|       6|    1|
+--------+-----+

Now, you can turn to dictionary like Counter using rdd

dict(df_count.rdd.map(lambda x: (x['MYCOLUMN'], x['count'])).collect())

This will give output as {1: 2, 2: 1, 5: 3, 6: 1}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM