简体   繁体   中英

Aliases and Group By Statements in SAS Proc SQL

I am working with proc SQL in SAS and one of my proc sql queries is behaving very oddly:

I have a large data set (about 1 Million rows), which looks something like this:

apple_key    profit    price    cost    months    date      
golden_d     0.03      12       4       3         01/12
golden_d     0.03      8        0       2         01/12
granny_s     0.05      15       5       5         02/12
red_d        0.04      13       0       1         01/12
golden_d     0.02      1        2       12         03/14

On this data set I am running the following query:

%let picking_date = 01/12; /* I simplify here - this part of my code definitely works */

proc sql; 
    CREATE TABLE output AS 
    SELECT 
        (CASE apple_key
              WHEN "golden_d" THEN 1
              WHEN "granny_s" THEN 2
              WHEN "red_d"    THEN 3
        END) AS apple_id,
        apple_key AS apple_name,
        (CASE WHEN cost= 0 THEN 0 
            ELSE 1 
        END) AS cost_flag,
        (CASE 
            WHEN CEIL(months / 2) < 5 THEN CEIL(months / 2) 
            ELSE 5
        END) AS age, 
        "McDonalds" as farm, 
        sum(profit*price)/sum(price) as price_weighted_profit
    FROM input_table
    WHERE date = "&picking_date."d
        AND price > cost
        AND cost >= 0
        AND cost >= 0
    GROUP BY apple_id, apple_name, cost_flag, age, farm
    ; 
run; 

When I run this my GROUP BY statement does not work. I get a bunch of entries for a single group (where apple_id, apple_name, cost_flag, age and farm are all the same, but my aggregation is not working).

However, when I run the GROUP BY separately (as follows) everything works perfectly fine. I get one entry for each group with a "price weighted profit":

proc sql; 
    CREATE TABLE output_tmp AS 
    SELECT 
        (CASE apple_key
              WHEN "golden_d" THEN 1
              WHEN "granny_s" THEN 2
              WHEN "red_d"    THEN 3
        END) AS apple_id,
        apple_key AS apple_name,
        (CASE WHEN cost= 0 THEN 0 
            ELSE 1 
        END) AS cost_flag,
        (CASE 
            WHEN CEIL(months / 2) < 5 THEN CEIL(months / 2) 
            ELSE 5
        END) AS age, 
        "McDonalds" as farm
    FROM input_table
    WHERE date = "&picking_date."d
        AND price > cost
        AND cost >= 0
        AND cost >= 0
   ;

    CREATE TABLE output AS
    SELECT 
        apple_id, 
        apple_name, 
        cost_flag, 
        age, 
        farm,
        sum(profit*price)/sum(price) as price_weighted_profit
    FROM output_tmp
    GROUP BY apple_id, apple_name, cost_flag, age, farm
    ;
quit;

Why is this happening? How can I fix it? This is driving me a little bit crazy... Thanks up front for the help

It doesn't work because group by is not taking sum(profit*price)/sum(price) statement as aggregated function. it does not do that because of aliases like age, cost_flag etc.

Anyways below is the right query:-

 Proc sql;
    CREATE TABLE output AS 
     SELECT 
            apple_id, 
            apple_name, 
            cost_flag, 
            age, 
            farm, 
            sum(profit*price)/sum(price) as price_weighted_profit
        FROM
       (
        SELECT 
            (CASE apple_key
                  WHEN "golden_d" THEN 1
                  WHEN "granny_s" THEN 2
                  WHEN "red_d"    THEN 3
            END) AS apple_id,
            apple_key AS apple_name,
            (CASE WHEN cost= 0 THEN 0 
                ELSE 1 
            END) AS cost_flag,
            (CASE 
                WHEN CEIL(months / 2) < 5 THEN CEIL(months / 2) 
                ELSE 5
            END) AS age, 
            "McDonalds" as farm
        FROM input_table
        WHERE date = "&picking_date."d
            AND price > cost
            AND cost >= 0
            AND cost >= 0

        ) a
        GROUP BY apple_id, apple_name, cost_flag, age, farm;
        quit;

Let me know if you have any questions

Thumb rule :- Whenever you use any aggregate functions in your select clause the remaining columns should be part of the group by. In your question you posted you are apply sum(profit*price)/sum(price) but there is no group by which is causing the problem.

Proc sql;
    CREATE TABLE output AS 
        SELECT 
            (CASE apple_key
                  WHEN "golden_d" THEN 1
                  WHEN "granny_s" THEN 2
                  WHEN "red_d"    THEN 3
            END) AS apple_id,
            apple_key AS apple_name,
            (CASE WHEN cost= 0 THEN 0 
                ELSE 1 
            END) AS cost_flag,
            (CASE 
                WHEN CEIL(months / 2) < 5 THEN CEIL(months / 2) 
                ELSE 5
            END) AS age, 
            "McDonalds" as farm, 
            sum(profit*price)/sum(price) as price_weighted_profit
        FROM input_table
        WHERE date = "&picking_date."d
            AND price > cost
            AND cost >= 0
            AND cost >= 0    
        GROUP BY apple_id, apple_name, cost_flag, age, farm;
        quit;

I suspect that what is happening is remerging . SAS proc sql accepts code like this:

proc sql;
    select a.*, count(*)
    from a;

This does not summarize the data. Instead it puts the overall count on each row. In other words, if the keys in the select don't exactly match the group by , then the aggregation functions are calculated based on the group by keys, but the results are put onto the individual rows. Other databases do this with a subset of the window functions.

In your case, remerging is not obvious. I think there is key confusion because you are using the same names in the select as in the original data. My advice is to change the aliases so they are unambiguous and be sure that the keys in the group by are unambiguous.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM