Pyspark: count on pyspark.sql.dataframe.DataFrame takes long time

Question

I have a pyspark.sql.dataframe.DataFrame like the following

df.show()
+--------------------+----+----+---------+----------+---------+----------+---------+
|                  ID|Code|bool|      lat|       lon|       v1|        v2|       v3|
+--------------------+----+----+---------+----------+---------+----------+---------+
|5ac52674ffff34c98...|IDFA|   1|42.377167| -71.06994|17.422535|1525319638|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37747|-71.069824|17.683573|1525319639|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37757| -71.06942|22.287935|1525319640|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37761| -71.06943|19.110023|1525319641|36.853622|
|5ac52674ffff34c98...|IDFA|   1|42.377243| -71.06952|18.904774|1525319642|36.853622|
|5ac52674ffff34c98...|IDFA|   1|42.378254| -71.06948|20.772903|1525319643|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37801| -71.06983|18.084948|1525319644|36.853622|
|5ac52674ffff34c98...|IDFA|   1|42.378693| -71.07033| 15.64326|1525319645|36.853622|
|5ac52674ffff34c98...|IDFA|   1|42.378723|-71.070335|21.093477|1525319646|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37868| -71.07034|21.851894|1525319647|36.853622|
|5ac52674ffff34c98...|IDFA|   1|42.378716| -71.07029|20.583202|1525319648|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37872| -71.07067|19.738768|1525319649|36.853622|
|5ac52674ffff34c98...|IDFA|   1|42.379112| -71.07097|20.480911|1525319650|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37952|  -71.0708|20.526752|1525319651| 44.93808|
|5ac52674ffff34c98...|IDFA|   1| 42.37902| -71.07056|20.534052|1525319652| 44.93808|
|5ac52674ffff34c98...|IDFA|   1|42.380203|  -71.0709|19.921381|1525319653| 44.93808|
|5ac52674ffff34c98...|IDFA|   1| 42.37968|-71.071144| 20.12599|1525319654| 44.93808|
|5ac52674ffff34c98...|IDFA|   1|42.379696| -71.07114|18.760069|1525319655| 36.77853|
|5ac52674ffff34c98...|IDFA|   1| 42.38011| -71.07123|19.155525|1525319656| 36.77853|
|5ac52674ffff34c98...|IDFA|   1| 42.38022|  -71.0712|16.978994|1525319657| 36.77853|
+--------------------+----+----+---------+----------+---------+----------+---------+
only showing top 20 rows

If try to count

%%time
df.count()

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 28.1 s

30241272

now if I take a subset of df the time to count is even longer.

id0 = df.first().ID  ## First ID
tmp = df.filter( (df['ID'] == id0) )

%%time
tmp.count()

CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 1min 33s
Out[6]:
3299

Answer 1

Your question is very ingesting and tricky..

I tested with a large dataset in order to reproduce your behavior.

Problem Description

I tested the following two cases in a large dataset:

# Case 1
df.count() # Execution time: 37secs

# Case 2
df.filter((df['ID'] == id0)).count() #Execution time: 1.39 min

Explaination

Lets see the Physical Plan with only .count() :

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#38L])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#41L])
      +- *(1) FileScan csv [] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

Lets see the physical plan with .filter() and then .count() :

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#61L])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#64L])
      +- *(1) Project
         +- *(1) Filter (isnotnull(ID#11) && (ID#11 = Muhammed MacIntyre))
            +- *(1) FileScan csv [ID#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:...], PartitionFilters: [], PushedFilters: [IsNotNull(ID), EqualTo(ID,Muhammed MacIntyre)], ReadSchema: struct<_c1:string>

Generally, Spark when counts the number of rows maps the rows with count=1 and the reduce all the mappers to create the final number of rows.

In the Case 2 Spark has first to filter and then create the partial counts for every partition and then having another stage to sum those up together. So, for the same rows, in the second case the Spark doing also the filtering, something that affects the computation time in large datasets. Spark is a framework for distributed processing and doesn't have indexes like Pandas, which could do the filtering extremely fast without passing all the rows.

Summary

In that simple case you can't do a lot of things to improve the execution time. You can try your application with different configuration settings (eg # spark.sql.shuffle.partitions, # spark.default.parallelism , # of executors , # executor memory etc)

Answer 2

This is because spark islazily evaluated . When you call tmp.count(), that is an action step. In other words, your timing of tmp.count also includes the filter time. If you want to truly compare the two counts, try the following:

%%time
df.count()

id0 = df.first().ID  ## First ID
tmp = df.filter( (df['ID'] == id0) )
tmp.persist().show()

%%time
tmp.count()

The important component here is the tmp.persist().show() BEFORE performing the count. This performs the filter and caches the result. That way, the tmp.count() only includes the actual count time.

Pyspark: count on pyspark.sql.dataframe.DataFrame takes long time

Question

2 answers

solution1
0 2020-02-01 17:28:07

Problem Description

Explaination

Summary

solution2
-1 2020-01-31 17:02:26

Pyspark: count on pyspark.sql.dataframe.DataFrame takes long time

Question

2 answers

solution1 0 2020-02-01 17:28:07

Problem Description

Explaination

Summary

solution2 -1 2020-01-31 17:02:26

solution1
0 2020-02-01 17:28:07

solution2
-1 2020-01-31 17:02:26