简体   繁体   中英

PySpark - loop through each row of dataframe and run a hive query

I have a dataframe with 100 rows [ name, age, date, hour] . I need to partition this dataframe with distinct values of date. Let's say there are 20 distinct date values in these 100 rows , then i need to spawn up 20 parallel hive queries where each hive QL will join each of these partitions with a hive table . Hive table - [dept, couse , date] is partitioned by date field.

Hive table is huge and hence I need to optimize these joins in to multiple smaller joins and then aggregate these results. Any recommendations on how can I achieve this ?

You can do this in single query. Partition both df on date and join. During join broadcast you first table which have small data (~10MB). Here is example:-

df3 = df1.repartition("date").join(
F.broadcast(df2.repartition("date")), 
"date"
)
#df2 is your dataframe smaller dataframe in your case it is name, age, date, ,hour.
#Now perform any operation on df3  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM