I am trying to broadcast spark dataframe, tried couple of approach but not able to broadcast it. I want to loop all the columns for some processing from another data frame where in SchemaWithHeader
colName Result is 1. For example - Loop is required for columns - Name, Age and Salary.
SchemaDFWithoutHeader = [('Name',1),('Age',1),('gender',0),('dept',0),("salary",1)] rdd = spark.sparkContext.broadcast(SchemaDFWithoutHeader) SchemaWithHeader = rdd.map(lambda x: Row(ColName=x[0], Result=bool(x[1])))
getting below error
SchemaWithHeader = rdd.map(lambda x: Row(ColName=x[0], Result=bool(x[1])))
AttributeError: 'Broadcast' object has no attribute 'map'
Dataframe doesn't have any broadcast method. I am not using SQL query to join 2 data frames but using some loop to access SchemaWithHeader
data frame.
SchemaDFWithoutHeader = [('Name',1),('Age',1),('gender',0),('dept',0),("salary",1)] rdd = spark.sparkContext.parallelize(SchemaDFWithoutHeader) SchemaWithHeader = rdd.map(lambda x: Row(ColName=x[0], Result=bool(x[1]))) SchemaDF = spark.createDataFrame(SchemaWithHeader) spark.sparkContext.broadcast(SchemaDF) SchemaDF.registerTempTable("DFSchema")
getting below error
py4j.Py4JException: Method __getstate__([]) does not exist
Error says it all... In your code below
rdd = spark.sparkContext.broadcast(SchemaDFWithoutHeader)
rdd is a broadcasted variable, to use map on it do rdd.value. Following is the way to use it.
SchemaWithHeader = rdd.value.map(lambda x: Row(ColName=x[0], Result=bool(x[1])))
Hope This helps... Keep Sharing with Community :)
Edit 1: Since you are broadcasting a list rdd.value will give you a list as output. list in python does not have map function. so you are getting error mentioned in comments. Moreover if you try to broadcast a RDD you will get following error " It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations;"
Basically You cannot Broadcast an RDD because it is a distributed data structure already and has partitions and these partitions already sit on multiple machines.
Note : Hope the code that you wrote was just to demonstrate the issue. As i could not understand your thought process behind this. However, Answer is still valid. I recommend you to understand broadcast variables concept , before implementing in your Project.
Cheers!
Harjeet
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.