简体   繁体   English

PySpark-广播Spark数据框

[英]PySpark - Broadcast spark dataframe

I am trying to broadcast spark dataframe, tried couple of approach but not able to broadcast it. 我正在尝试广播Spark数据帧,尝试了几种方法,但无法广播它。 I want to loop all the columns for some processing from another data frame where in SchemaWithHeader colName Result is 1. For example - Loop is required for columns - Name, Age and Salary. 我想循环所有列,以便在SchemaWithHeader colName结果为1的另一个数据框中进行某些处理。例如-列需要循环-名称,年龄和薪水。

  • Approach 1 方法1
 SchemaDFWithoutHeader = [('Name',1),('Age',1),('gender',0),('dept',0),("salary",1)] rdd = spark.sparkContext.broadcast(SchemaDFWithoutHeader) SchemaWithHeader = rdd.map(lambda x: Row(ColName=x[0], Result=bool(x[1]))) 

getting below error 低于错误

 SchemaWithHeader = rdd.map(lambda x: Row(ColName=x[0], Result=bool(x[1])))
AttributeError: 'Broadcast' object has no attribute 'map'

Dataframe doesn't have any broadcast method. 数据框没有任何广播方法。 I am not using SQL query to join 2 data frames but using some loop to access SchemaWithHeader data frame. 我没有使用SQL查询来连接2个数据帧,而是使用一些循环来访问SchemaWithHeader数据帧。

  • Approach 2 方法2
 SchemaDFWithoutHeader = [('Name',1),('Age',1),('gender',0),('dept',0),("salary",1)] rdd = spark.sparkContext.parallelize(SchemaDFWithoutHeader) SchemaWithHeader = rdd.map(lambda x: Row(ColName=x[0], Result=bool(x[1]))) SchemaDF = spark.createDataFrame(SchemaWithHeader) spark.sparkContext.broadcast(SchemaDF) SchemaDF.registerTempTable("DFSchema") 

getting below error 低于错误

py4j.Py4JException: Method __getstate__([]) does not exist

Error says it all... In your code below 错误说明了一切...在下面的代码中

rdd = spark.sparkContext.broadcast(SchemaDFWithoutHeader)

rdd is a broadcasted variable, to use map on it do rdd.value. rdd是一个广播变量,要在其上使用map做rdd.value。 Following is the way to use it. 以下是使用它的方法。

SchemaWithHeader = rdd.value.map(lambda x: Row(ColName=x[0], Result=bool(x[1])))

Hope This helps... Keep Sharing with Community :) 希望对您有所帮助...继续与社区共享:)

Edit 1: Since you are broadcasting a list rdd.value will give you a list as output. 编辑1:由于您正在广播列表rdd.value,因此会给您一个列表作为输出。 list in python does not have map function. python中的list没有地图功能。 so you are getting error mentioned in comments. 因此您收到注释中提到的错误。 Moreover if you try to broadcast a RDD you will get following error " It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations;" 此外,如果您尝试广播RDD,则会收到以下错误消息:“您似乎正在广播RDD或从操作或转换引用RDD。RDD转换和操作只能由驱动程序调用,而不能在其他内部调用转变;”

Basically You cannot Broadcast an RDD because it is a distributed data structure already and has partitions and these partitions already sit on multiple machines. 基本上,您不能广播RDD,因为它已经是分布式数据结构并且具有分区,并且这些分区已经位于多台计算机上。

Note : Hope the code that you wrote was just to demonstrate the issue. 注意:希望您编写的代码仅用于演示问题。 As i could not understand your thought process behind this. 由于我无法理解您在此背后的思考过程。 However, Answer is still valid. 但是,答案仍然有效。 I recommend you to understand broadcast variables concept , before implementing in your Project. 我建议您在项目中实施之前了解广播变量的概念。

Cheers! 干杯!

Harjeet 哈吉特

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM