show() 大 dataframe pyspark 的子集

Question

I have a big pyspark dataframe which I am performing a number of transformations on and joining with other dataframes.我有一个大的 pyspark dataframe ，我正在对其他数据帧执行一些转换并与其他数据帧连接。 I would like to investigate if the transformations and joins succeed and if the datafram looks like it is intended, but how can I show a small subset of the dataframe.我想调查转换和连接是否成功，以及数据帧是否看起来像预期的那样，但我怎样才能显示 dataframe 的一小部分。

I have tried numerous things eg我尝试了很多东西，例如

df.show(5)

and和

df.limit(5).show()

but everything I try requires a large portion of jobs resulting in slow performance.但我尝试的一切都需要大量工作，导致性能下降。 I could spin up a very large cluster, but is there a way of getting only a small subset of the dataframe, fast?我可以启动一个非常大的集群，但是有没有办法快速获得 dataframe 的一小部分？

Answer 1

Try the rdd equivalent of the dataframe尝试与 dataframe 等效的 rdd

 rdd_df = df.rdd
 rdd_df.take(5)

Or, Try to print the dataframe schema或者，尝试打印 dataframe 架构

 df.printSchema()

Answer 2

First, to show a certain number of rows you can use the limit() method after calling a select() method, like this:首先，要显示一定数量的行，您可以在调用select()方法后使用limit() ) 方法，如下所示：

df.select('*').limit(5).show()

also, the df.show() action will only print the first 20 rows, it will not print the whole dataframe.此外， df.show()操作只会打印前 20 行，不会打印整个 dataframe。

second, and more importantly,第二，更重要的是，

Spark Actions:火花动作：

spark dataframe does not contain data, it contains instructions and operation graph, since spark works with big data it does not allow to perform any operation as its called, to prevent slow performance, instead, methods are separated into two kinds Actions and Transformations , transformations are collected and contained as an operation graph. spark dataframe 不包含数据，它包含指令和操作图，由于 spark 使用大数据，因此不允许执行任何操作，以防止性能下降，而是将方法分为两种Actions和Transformations ，转换被收集并包含为操作图。

Action is a method that causes the dataframe to execute all accumulated operation in the graph, causing the slow performances, since it execute everything (note, UDFs are extremely slow). Action 是一种导致 dataframe 执行图中所有累积操作的方法，导致性能缓慢，因为它执行所有操作（注意，UDF 非常慢）。

show() is an action, when you call show() it has to calculate every other transformation to show you the true data. show()是一个动作，当您调用show()时，它必须计算所有其他转换以向您显示真实数据。

keep that in mind.记在脑子里。

Answer 3

to faster iterate you have to understand the difference between actions and transformation.要更快地迭代，您必须了解操作和转换之间的区别。

The Transformation is defined by any operation that result into another RDD/Spark Dataframe for example df.filter.join.groupBy .转换由任何导致另一个 RDD/Spark Dataframe 的操作定义，例如df.filter.join.groupBy 。 Action is defined by any operation that result into non-RDD for example df.write. or df.count() or df.show()动作由任何导致非 RDD 的操作定义，例如df.write. or df.count() or df.show() df.write. or df.count() or df.show() . df.write. or df.count() or df.show() 。

The transformation is lazy, saying the not like python df1=df.filter, df2=df1.groupby df and df1 and df3 was in memory.转换是懒惰的，说不像 python df1=df.filter, df2=df1.groupby df 和 df1 和 df3 在 memory 中。 Instead the data will flow into memory until you call an action.相反，数据将流入 memory 直到您调用操作。 like in your case .show()就像你的情况一样.show()

And calling df.limit(5).show() will not fastern your job iteration, because this limit is limiting the final dataframe gets print out instead the original data that is flowing through your memory.调用df.limit(5).show()不会加快你的工作迭代，因为这个限制限制了最终的 dataframe 打印出来，而不是流过你的 memory 的原始数据。

Like others suggestion, you should be able to limit your input datasize in order to faster test whether your transformation works.像其他人的建议一样，您应该能够限制输入数据大小，以便更快地测试您的转换是否有效。 And further improve your iteration, you could cache dataframe from matured transformation, instead of running them over and over again.并进一步改进您的迭代，您可以从成熟的转换中缓存 dataframe，而不是一遍又一遍地运行它们。

show() 大 dataframe pyspark 的子集

问题描述

3 个解决方案

解决方案1
0 2019-11-20 13:00:07

解决方案2
0 2019-11-20 14:55:51

Spark Actions:火花动作：

解决方案3
0 2019-11-20 21:03:30

show() 大 dataframe pyspark 的子集

问题描述

3 个解决方案

解决方案1 0 2019-11-20 13:00:07

解决方案2 0 2019-11-20 14:55:51

Spark Actions:火花动作：

解决方案3 0 2019-11-20 21:03:30

解决方案1
0 2019-11-20 13:00:07

解决方案2
0 2019-11-20 14:55:51

解决方案3
0 2019-11-20 21:03:30