[英]Hive is not supporting subquery, I need to get max date from self join table
SELECT
a.FIRST_NAME, a.LAST_NAME, a.MANAGER, a.MANAGER_CODE, a.EMPLOYEE_CODE,
b.FIRST_NAME AS MANAGER_FN, b.LAST_NAME AS MANAGER_LN,
c.FIRST_NAME AS DIRECTOR_FN, c.LAST_NAME AS DIRECTOR_LN
FROM
emp_table a
LEFT JOIN
(SELECT DISTINCT manager_code, employee_code, first_name, last_name, end_date
FROM emp_table) AS b ON a.manager_code = b.employee_code
AND b.end_date = (SELECT MAX(end_date) FROM table)
LEFT JOIN
(SELECT DISTINCT manager_code,employee_code, first_NAME, LAST_NAME, END_DATE
FROM emp_table) AS c ON b.manager_code = c.employee_code
AND c.END_DATE = (SELECT MAX(end_date) FROM emp_table);
错误:
编译语句时出错:FAILED: SemanticException org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSubquerySemanticException: 0:0
不支持的子查询表达式 目前子查询表达式只允许作为 Where 和 Getting Clause 谓词。
在令牌“end_date”附近遇到错误
我认为它在查询中的LEFT JOIN
条件的AND
部分失败。
您可以使用解析函数计算同一数据集中不分组的最大日期,然后进行过滤。
SELECT
a.FIRST_NAME, a.LAST_NAME, a.MANAGER, a.MANAGER_CODE, a.EMPLOYEE_CODE,
b.FIRST_NAME AS MANAGER_FN, b.LAST_NAME AS MANAGER_LN,
c.FIRST_NAME AS DIRECTOR_FN, c.LAST_NAME AS DIRECTOR_LN
FROM
emp_table a
LEFT JOIN
(SELECT DISTINCT manager_code, employee_code, first_name, last_name, end_date
FROM
(SELECT manager_code, employee_code, first_name, last_name, end_date,
max(end_date) over() as max_end_date
FROM emp_table) b
WHERE b.end_date = b.max_end_date --Filter
) c ON a.manager_code = b.employee_code
LEFT JOIN
(SELECT DISTINCT manager_code,employee_code, first_NAME, LAST_NAME, END_DATE
FROM
(SELECT manager_code, employee_code, first_NAME, LAST_NAME, END_DATE,
MAX(end_date) over() as max_end_date
FROM emp_table) c
WHERE c.end_date = c.max_end_date --Filter
) c ON b.manager_code = c.employee_code
;
一般来说,它会简化查询以拆分从子查询中获取最新记录。
如果您使用 pyspark,因为它在这里被标记,建议使用数据帧 API 来处理您的连接和转换。 总的来说,已经发现使用具有复杂查询的 spark.sql 比使用 spark.sql 性能更好。
{"first_name":"alfred", "last_name":"pennyworth", "manager":"dick grayson", "manager_code":"2", "employeecode":"1", "end_date":"12/1/2021"}
{"first_name":"alfred", "last_name":"pennyworth", "manager":"dick grayson", "manager_code":"2", "employeecode":"1", "end_date":"12/2/2021"}
{"first_name":"dick", "last_name":"grayson", "manager":"bruce wayne", "manager_code":"3", "employeecode":"2", "end_date":"12/2/2021"}
{"first_name":"bruce", "last_name":"wayne", "manager":"bob kane", "manager_code":"4", "employeecode":"3", "end_date":"12/2/2021"}
def get_current_records(df,partition_columns, order_column):
"""Removes duplicate records from a spark dataframe and returns the most recent record based on given partiion and order by column
Args:
df (spark dataframe): The dataframe we want to get the most recent records for
partition_columns (list): A list of columns to be used for getting the most recent record for. ( The group by columns)
order_column (string): The column to order by
Returns:
spark dataframe: Dataframe that has had the logic applied to it
"""
df = df.dropDuplicates()
df = df.withColumn("rn", row_number().over(Window.partitionBy(partition_columns).orderBy(col(order_column).desc())))
df = df.filter(col("rn") == 1).drop("rn")
return df
employee_df = spark.read.json('tests/data/loan_etl/emp.json')
employee_df = get_current_records(employee_df, ['employeecode'], 'end_date')
employee_df = employee_df.alias('emp')\
.join(employee_df.alias('manager'), (col('emp.manager_code') == col('manager.employeecode')), how='left')\
.join(employee_df.alias('director'), (col('manager.manager_code') == col('director.employeecode')), how='left')\
.select(\
col('emp.first_name'),\
col('emp.last_name'),\
col('emp.manager'),\
col('emp.manager_code'),\
col('emp.employeecode'),\
col('manager.first_name').alias('manager_first_name'),\
col('manager.last_name').alias('manager_last_name'),\
col('director.first_name').alias('director_first_name'),\
col('director.last_name').alias('director_last_name')\
)
employee_df.orderBy(col('emp.employeecode')).show(10,False)
这会产生以下输出:
+----------+----------+------------+------------+------------+------------------+-----------------+-------------------+------------------+
|first_name|last_name |manager |manager_code|employeecode|manager_first_name|manager_last_name|director_first_name|director_last_name|
+----------+----------+------------+------------+------------+------------------+-----------------+-------------------+------------------+
|alfred |pennyworth|dick grayson|2 |1 |dick |grayson |bruce |wayne |
|dick |grayson |bruce wayne |3 |2 |bruce |wayne |null |null |
|bruce |wayne |bob kane |4 |3 |null |null |null |null |
+----------+----------+------------+------------+------------+------------------+-----------------+-------------------+------------------+
通过利用获取最新记录或持久化最新记录的配置单元视图,可以仅使用配置单元处理类似的过程。
示例中的逻辑在尝试标记层次结构时也存在问题,因为经理将成为主管,这是一个单独的问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.