PySpark - loop through each row of dataframe and run a hive query

Question

I have a dataframe with 100 rows [ name, age, date, hour] . I need to partition this dataframe with distinct values of date. Let's say there are 20 distinct date values in these 100 rows , then i need to spawn up 20 parallel hive queries where each hive QL will join each of these partitions with a hive table . Hive table - [dept, couse , date] is partitioned by date field.

Hive table is huge and hence I need to optimize these joins in to multiple smaller joins and then aggregate these results. Any recommendations on how can I achieve this ?

Answer 1

You can do this in single query. Partition both df on date and join. During join broadcast you first table which have small data (~10MB). Here is example:-

df3 = df1.repartition("date").join(
F.broadcast(df2.repartition("date")), 
"date"
)
#df2 is your dataframe smaller dataframe in your case it is name, age, date, ,hour.
#Now perform any operation on df3

PySpark - loop through each row of dataframe and run a hive query

Question

1 answers

solution1
0 2017-09-19 05:28:56

PySpark - loop through each row of dataframe and run a hive query

Question

1 answers

solution1 0 2017-09-19 05:28:56

solution1
0 2017-09-19 05:28:56