繁体   English   中英

Hive 不支持子查询,我需要从自连接表中获取最大日期

[英]Hive is not supporting subquery, I need to get max date from self join table

SELECT
    a.FIRST_NAME, a.LAST_NAME, a.MANAGER, a.MANAGER_CODE, a.EMPLOYEE_CODE, 
    b.FIRST_NAME AS MANAGER_FN, b.LAST_NAME AS MANAGER_LN, 
    c.FIRST_NAME AS DIRECTOR_FN, c.LAST_NAME AS DIRECTOR_LN
FROM
    emp_table a 
LEFT JOIN
    (SELECT DISTINCT manager_code, employee_code, first_name, last_name, end_date 
     FROM emp_table) AS b ON a.manager_code = b.employee_code
                          AND b.end_date = (SELECT MAX(end_date) FROM table)
LEFT JOIN 
    (SELECT DISTINCT manager_code,employee_code, first_NAME, LAST_NAME, END_DATE 
     FROM emp_table) AS c ON b.manager_code = c.employee_code 
                          AND c.END_DATE = (SELECT MAX(end_date) FROM emp_table);

错误

编译语句时出错:FAILED: SemanticException org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSubquerySemanticException: 0:0
不支持的子查询表达式 目前子查询表达式只允许作为 Where 和 Getting Clause 谓词。
在令牌“end_date”附近遇到错误

我认为它在查询中的LEFT JOIN条件的AND部分失败。

您可以使用解析函数计算同一数据集中不分组的最大日期,然后进行过滤。

SELECT
    a.FIRST_NAME, a.LAST_NAME, a.MANAGER, a.MANAGER_CODE, a.EMPLOYEE_CODE, 
    b.FIRST_NAME AS MANAGER_FN, b.LAST_NAME AS MANAGER_LN, 
    c.FIRST_NAME AS DIRECTOR_FN, c.LAST_NAME AS DIRECTOR_LN
FROM
    emp_table a 
LEFT JOIN
    (SELECT DISTINCT manager_code, employee_code, first_name, last_name, end_date
       FROM
           (SELECT manager_code, employee_code, first_name, last_name, end_date,
                   max(end_date) over() as max_end_date
              FROM emp_table) b 
      WHERE b.end_date = b.max_end_date  --Filter
     ) c ON a.manager_code = b.employee_code
LEFT JOIN 
    (SELECT DISTINCT manager_code,employee_code, first_NAME, LAST_NAME, END_DATE
       FROM
           (SELECT manager_code, employee_code, first_NAME, LAST_NAME, END_DATE,
                   MAX(end_date) over() as max_end_date 
            FROM emp_table) c
      WHERE c.end_date = c.max_end_date  --Filter
    ) c ON b.manager_code = c.employee_code 
                          ;

一般来说,它会简化查询以拆分从子查询中获取最新记录。

如果您使用 pyspark,因为它在这里被标记,建议使用数据帧 API 来处理您的连接和转换。 总的来说,已经发现使用具有复杂查询的 spark.sql 比使用 spark.sql 性能更好。

示例源数据

{"first_name":"alfred", "last_name":"pennyworth", "manager":"dick grayson", "manager_code":"2", "employeecode":"1", "end_date":"12/1/2021"}
{"first_name":"alfred", "last_name":"pennyworth", "manager":"dick grayson", "manager_code":"2", "employeecode":"1", "end_date":"12/2/2021"}
{"first_name":"dick", "last_name":"grayson", "manager":"bruce wayne", "manager_code":"3", "employeecode":"2", "end_date":"12/2/2021"}
{"first_name":"bruce", "last_name":"wayne", "manager":"bob kane", "manager_code":"4", "employeecode":"3", "end_date":"12/2/2021"}

示例代码

def get_current_records(df,partition_columns, order_column):
    """Removes duplicate records from a spark dataframe and returns the most recent record based on given partiion and order by column

    Args:
        df (spark dataframe): The dataframe we want to get the most recent records for 
        partition_columns (list): A list of columns to be used for getting the most recent record for. ( The group by columns)
        order_column (string): The column to order by

    Returns:
        spark dataframe: Dataframe that has had the logic applied to it
    """
    df = df.dropDuplicates() 
    df = df.withColumn("rn", row_number().over(Window.partitionBy(partition_columns).orderBy(col(order_column).desc())))
    df = df.filter(col("rn") == 1).drop("rn")
    return df 

employee_df = spark.read.json('tests/data/loan_etl/emp.json')
        
employee_df = get_current_records(employee_df, ['employeecode'], 'end_date')

employee_df = employee_df.alias('emp')\
            .join(employee_df.alias('manager'), (col('emp.manager_code') == col('manager.employeecode')), how='left')\
            .join(employee_df.alias('director'), (col('manager.manager_code') == col('director.employeecode')), how='left')\
            .select(\
                col('emp.first_name'),\
                col('emp.last_name'),\
                col('emp.manager'),\
                col('emp.manager_code'),\
                col('emp.employeecode'),\
                col('manager.first_name').alias('manager_first_name'),\
                col('manager.last_name').alias('manager_last_name'),\
                col('director.first_name').alias('director_first_name'),\
                col('director.last_name').alias('director_last_name')\
            )

employee_df.orderBy(col('emp.employeecode')).show(10,False)



这会产生以下输出:

+----------+----------+------------+------------+------------+------------------+-----------------+-------------------+------------------+
|first_name|last_name |manager     |manager_code|employeecode|manager_first_name|manager_last_name|director_first_name|director_last_name|
+----------+----------+------------+------------+------------+------------------+-----------------+-------------------+------------------+
|alfred    |pennyworth|dick grayson|2           |1           |dick              |grayson          |bruce              |wayne             |
|dick      |grayson   |bruce wayne |3           |2           |bruce             |wayne            |null               |null              |
|bruce     |wayne     |bob kane    |4           |3           |null              |null             |null               |null              |
+----------+----------+------------+------------+------------+------------------+-----------------+-------------------+------------------+

通过利用获取最新记录或持久化最新记录的配置单元视图,可以仅使用配置单元处理类似的过程。

示例中的逻辑在尝试标记层次结构时也存在问题,因为经理将成为主管,这是一个单独的问题。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM