简体   繁体   English

在Scala方法之间传递Spark数据帧-性能

[英]Passing Spark dataframe between scala methods - Performance

Recently, I have developed a Spark Streaming application using Scala and Spark. 最近,我使用Scala和Spark开发了一个Spark Streaming应用程序。 In this application, I have extensively used Implicit Class (Pimp my Library pattern) to implement more general utilities like Writing a Dataframe to HBase by creating an implicit class that is extending Spark's Dataframe. 在此应用程序中,我广泛使用了隐式类(Pimp我的库模式)来实现更通用的实用程序,例如通过创建扩展Spark数据框的隐式类,将数据帧写入HBase。 For example, 例如,

implicit class DataFrameExtension(private val dataFrame: DataFrame) extends Serializable { ..... // Custom methods to perform some computations }

However, a senior architect from my team refactored the code (specifying some style mismatch and performance as a reason) and copied these methods to a new class. 但是,我们团队的一位高级架构师重构了代码(将某些样式不匹配和性能作为原因),并将这些方法复制到了新类中。 Now, these methods accept Dataframe as an argument. 现在,这些方法接受Dataframe作为参数。

Can anyone help me on, 谁能帮我

  1. Whether Scala's implicit classes creates any overhead during run-time? Scala的隐式类在运行时是否会产生任何开销?
  2. Does moving dataframe object between methods creates any overhead, either in terms of method calls or serialization? 在方法之间移动数据框对象是否会产生方法调用或序列化方面的任何开销?
  3. I have searched a bit, but couldn't find any style guide that gives guidelines on using implicit classes or methods over traditional methods. 我进行了一些搜索,但是找不到任何样式指南,该指南提供了有关在传统方法上使用隐式类或方法的指南。

Thanks in advance. 提前致谢。

Whether Scala's implicit classes creates any overhead during run-time? Scala的隐式类在运行时是否会产生任何开销?

Not in your case. 不在您的情况下。 There is some overhead when the implicit type is AnyVal (thus needs to be boxed). 当隐式类型是AnyVal时,会产生一些开销(因此需要装箱)。 Implicits are resolved during compile time, and except for maybe a few virtual method calls there should be no overhead. 隐式是在编译时解决的,除了可能调用一些虚拟方法外,应该没有开销。

Does moving dataframe object between methods creates any overhead, either in terms of method calls or serialization? 在方法之间移动数据框对象是否会产生方法调用或序列化方面的任何开销?

No, no more then any other type. 不,没有其他类型的了。 Obviously there will be no serialization. 显然不会进行序列化。

... if I pass dataframes between methods in Spark code, it might create closure and as a result, will bring the parent class that holds the dataframe object. ...如果我在Spark代码中的方法之间传递数据框,则可能会创建闭包,结果将带来保存数据框对象的父类。

Only if you use scoped variables inside your dataframe, for example filter($"col" === myVar) where myVar declared in the scope of the method. 仅当在数据框内使用范围变量时,例如filter($"col" === myVar) ,其中myVar在方法的范围内声明。 In this case, Spark might serialize the wrapping class, but it's easy to avoid that. 在这种情况下,Spark 可能会序列化包装类,但是很容易避免这种情况。 Please remember that dataframes are passed quite often and quite deep inside Spark code, and probably in every other library that you might be using (datasources, for example). 请记住,数据帧在Spark代码内部以及在您可能正在使用的每个其他库中(例如,数据源)都非常频繁地传递。

It is very common (and handy) to use extension implicit classes like you did. 像您一样使用扩展隐式类是非常普遍的(而且很方便)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM