I have a pyspark.sql.dataframe.DataFrame
like the following
df.show()
+--------------------+----+----+---------+----------+---------+----------+---------+
| ID|Code|bool| lat| lon| v1| v2| v3|
+--------------------+----+----+---------+----------+---------+----------+---------+
|5ac52674ffff34c98...|IDFA| 1|42.377167| -71.06994|17.422535|1525319638|36.853622|
|5ac52674ffff34c98...|IDFA| 1| 42.37747|-71.069824|17.683573|1525319639|36.853622|
|5ac52674ffff34c98...|IDFA| 1| 42.37757| -71.06942|22.287935|1525319640|36.853622|
|5ac52674ffff34c98...|IDFA| 1| 42.37761| -71.06943|19.110023|1525319641|36.853622|
|5ac52674ffff34c98...|IDFA| 1|42.377243| -71.06952|18.904774|1525319642|36.853622|
|5ac52674ffff34c98...|IDFA| 1|42.378254| -71.06948|20.772903|1525319643|36.853622|
|5ac52674ffff34c98...|IDFA| 1| 42.37801| -71.06983|18.084948|1525319644|36.853622|
|5ac52674ffff34c98...|IDFA| 1|42.378693| -71.07033| 15.64326|1525319645|36.853622|
|5ac52674ffff34c98...|IDFA| 1|42.378723|-71.070335|21.093477|1525319646|36.853622|
|5ac52674ffff34c98...|IDFA| 1| 42.37868| -71.07034|21.851894|1525319647|36.853622|
|5ac52674ffff34c98...|IDFA| 1|42.378716| -71.07029|20.583202|1525319648|36.853622|
|5ac52674ffff34c98...|IDFA| 1| 42.37872| -71.07067|19.738768|1525319649|36.853622|
|5ac52674ffff34c98...|IDFA| 1|42.379112| -71.07097|20.480911|1525319650|36.853622|
|5ac52674ffff34c98...|IDFA| 1| 42.37952| -71.0708|20.526752|1525319651| 44.93808|
|5ac52674ffff34c98...|IDFA| 1| 42.37902| -71.07056|20.534052|1525319652| 44.93808|
|5ac52674ffff34c98...|IDFA| 1|42.380203| -71.0709|19.921381|1525319653| 44.93808|
|5ac52674ffff34c98...|IDFA| 1| 42.37968|-71.071144| 20.12599|1525319654| 44.93808|
|5ac52674ffff34c98...|IDFA| 1|42.379696| -71.07114|18.760069|1525319655| 36.77853|
|5ac52674ffff34c98...|IDFA| 1| 42.38011| -71.07123|19.155525|1525319656| 36.77853|
|5ac52674ffff34c98...|IDFA| 1| 42.38022| -71.0712|16.978994|1525319657| 36.77853|
+--------------------+----+----+---------+----------+---------+----------+---------+
only showing top 20 rows
If try to count
%%time
df.count()
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 28.1 s
30241272
now if I take a subset of df
the time to count is even longer.
id0 = df.first().ID ## First ID
tmp = df.filter( (df['ID'] == id0) )
%%time
tmp.count()
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 1min 33s
Out[6]:
3299
Your question is very ingesting and tricky..
I tested with a large dataset in order to reproduce your behavior.
I tested the following two cases in a large dataset:
# Case 1
df.count() # Execution time: 37secs
# Case 2
df.filter((df['ID'] == id0)).count() #Execution time: 1.39 min
Lets see the Physical Plan with only .count()
:
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#38L])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#41L])
+- *(1) FileScan csv [] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
Lets see the physical plan with .filter()
and then .count()
:
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#61L])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#64L])
+- *(1) Project
+- *(1) Filter (isnotnull(ID#11) && (ID#11 = Muhammed MacIntyre))
+- *(1) FileScan csv [ID#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:...], PartitionFilters: [], PushedFilters: [IsNotNull(ID), EqualTo(ID,Muhammed MacIntyre)], ReadSchema: struct<_c1:string>
Generally, Spark when counts the number of rows maps the rows with count=1 and the reduce all the mappers to create the final number of rows.
In the Case 2 Spark has first to filter and then create the partial counts for every partition and then having another stage to sum those up together. So, for the same rows, in the second case the Spark doing also the filtering, something that affects the computation time in large datasets. Spark is a framework for distributed processing and doesn't have indexes like Pandas, which could do the filtering extremely fast without passing all the rows.
In that simple case you can't do a lot of things to improve the execution time. You can try your application with different configuration settings (eg # spark.sql.shuffle.partitions, # spark.default.parallelism
, # of executors
, # executor memory
etc)
This is because spark islazily evaluated . When you call tmp.count(), that is an action step. In other words, your timing of tmp.count also includes the filter time. If you want to truly compare the two counts, try the following:
%%time
df.count()
id0 = df.first().ID ## First ID
tmp = df.filter( (df['ID'] == id0) )
tmp.persist().show()
%%time
tmp.count()
The important component here is the tmp.persist().show() BEFORE performing the count. This performs the filter and caches the result. That way, the tmp.count() only includes the actual count time.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.