show() subset of big dataframe pyspark

Question

I have a big pyspark dataframe which I am performing a number of transformations on and joining with other dataframes. I would like to investigate if the transformations and joins succeed and if the datafram looks like it is intended, but how can I show a small subset of the dataframe.

I have tried numerous things eg

df.show(5)

and

df.limit(5).show()

but everything I try requires a large portion of jobs resulting in slow performance. I could spin up a very large cluster, but is there a way of getting only a small subset of the dataframe, fast?

Answer 1

Try the rdd equivalent of the dataframe

 rdd_df = df.rdd
 rdd_df.take(5)

Or, Try to print the dataframe schema

 df.printSchema()

Answer 2

First, to show a certain number of rows you can use the limit() method after calling a select() method, like this:

df.select('*').limit(5).show()

also, the df.show() action will only print the first 20 rows, it will not print the whole dataframe.

second, and more importantly,

Spark Actions:

spark dataframe does not contain data, it contains instructions and operation graph, since spark works with big data it does not allow to perform any operation as its called, to prevent slow performance, instead, methods are separated into two kinds Actions and Transformations , transformations are collected and contained as an operation graph.

Action is a method that causes the dataframe to execute all accumulated operation in the graph, causing the slow performances, since it execute everything (note, UDFs are extremely slow).

show() is an action, when you call show() it has to calculate every other transformation to show you the true data.

keep that in mind.

Answer 3

to faster iterate you have to understand the difference between actions and transformation.

The Transformation is defined by any operation that result into another RDD/Spark Dataframe for example df.filter.join.groupBy . Action is defined by any operation that result into non-RDD for example df.write. or df.count() or df.show() df.write. or df.count() or df.show() .

The transformation is lazy, saying the not like python df1=df.filter, df2=df1.groupby df and df1 and df3 was in memory. Instead the data will flow into memory until you call an action. like in your case .show()

And calling df.limit(5).show() will not fastern your job iteration, because this limit is limiting the final dataframe gets print out instead the original data that is flowing through your memory.

Like others suggestion, you should be able to limit your input datasize in order to faster test whether your transformation works. And further improve your iteration, you could cache dataframe from matured transformation, instead of running them over and over again.

show() subset of big dataframe pyspark

Question

3 answers

solution1
0 2019-11-20 13:00:07

solution2
0 2019-11-20 14:55:51

Spark Actions:

solution3
0 2019-11-20 21:03:30

show() subset of big dataframe pyspark

Question

3 answers

solution1 0 2019-11-20 13:00:07

solution2 0 2019-11-20 14:55:51

Spark Actions:

solution3 0 2019-11-20 21:03:30

solution1
0 2019-11-20 13:00:07

solution2
0 2019-11-20 14:55:51

solution3
0 2019-11-20 21:03:30