PySpark-遍历数据帧的每一行并运行配置单元查询

Question

I have a dataframe with 100 rows [ name, age, date, hour] . 我有一个包含100行[名称，年龄，日期，小时]的数据框。 I need to partition this dataframe with distinct values of date. 我需要用不同的日期值对该数据框进行分区。 Let's say there are 20 distinct date values in these 100 rows , then i need to spawn up 20 parallel hive queries where each hive QL will join each of these partitions with a hive table . 假设在这100行中有20个不同的日期值，那么我需要产生20个并行的配置单元查询，其中每个配置单元QL将使用配置单元表将这些分区中的每一个连接起来。 Hive table - [dept, couse , date] is partitioned by date field. 配置单元表-[部门，原因，日期]按日期字段划分。

Hive table is huge and hence I need to optimize these joins in to multiple smaller joins and then aggregate these results. Hive表很大，因此我需要将这些连接优化为多个较小的连接，然后汇总这些结果。 Any recommendations on how can I achieve this ? 关于如何实现此目标的任何建议？

Answer 1

You can do this in single query. 您可以在单个查询中执行此操作。 Partition both df on date and join. 在日期和加入上对df进行分区。 During join broadcast you first table which have small data (~10MB). 在加入广播期间，您的第一个表的数据很小（〜10MB）。 Here is example:- 这是示例：-

df3 = df1.repartition("date").join(
F.broadcast(df2.repartition("date")), 
"date"
)
#df2 is your dataframe smaller dataframe in your case it is name, age, date, ,hour.
#Now perform any operation on df3

PySpark-遍历数据帧的每一行并运行配置单元查询

问题描述

1 个解决方案

解决方案1
0 2017-09-19 05:28:56

PySpark-遍历数据帧的每一行并运行配置单元查询

问题描述

1 个解决方案

解决方案1 0 2017-09-19 05:28:56

解决方案1
0 2017-09-19 05:28:56