I have a Spark dataframe where columns are integers:
MYCOLUMN:
1
1
2
5
5
5
6
The goal is to get the output equivalent to collections.Counter([1,1,2,5,5,5,6])
. I can achieve the desired result by transforming the column to RDD, calling collect and the Counter, but this is rather slow for large data frames.
Is there a better approach that uses dataframes that can achieve the same result?
Maybe groupby
and count
is similar to what you need. Here is my solution to count each number using dataframe. I'm not sure if this is going to be faster than using RDD or not.
# toy example
df = spark.createDataFrame(pd.DataFrame([1, 1, 2, 5, 5, 5, 6], columns=['MYCOLUMN']))
df_count = df.groupby('MYCOLUMN').count().sort('MYCOLUMN')
Output from df_count.show()
+--------+-----+
|MYCOLUMN|count|
+--------+-----+
| 1| 2|
| 2| 1|
| 5| 3|
| 6| 1|
+--------+-----+
Now, you can turn to dictionary like Counter
using rdd
dict(df_count.rdd.map(lambda x: (x['MYCOLUMN'], x['count'])).collect())
This will give output as {1: 2, 2: 1, 5: 3, 6: 1}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.