简体   繁体   中英

How do I calculate the mode of a string variable within a group in SAS?

I can calculate the mode using a subquery in proc sql , but is this the simplest way to do this? This code handles ties that might occur when calculating the mode by relying on however the max function in proc sql breaks ties.

ods html file = "sas_output.html";

data raw_data;
 input cust_id $
       category $
       amount;
datalines;
A red 72.83
A red 80.22
A blue 0.2
A blue 33.62
A blue 30.63
A green 89.04
B blue 10.29
B red 97.07
B red 68.71
B red 98.2
B red 1.79
C green 92.94
C green 0.96
C red 15.23
D red 49.94
D blue 75.82
E blue 20.97
E blue 78.49
F green 87.92
F green 32.29
;
run;

proc sql;
  create table modes as 
    select cust_id, mean(amount) as mean_amount, category as mode_category
    from (
            select *, count(1) as count from raw_data group by cust_id, category
         )
    group by cust_id
    having count=max(count)
    order by cust_id;
quit;

data modes;
    set modes;
    by cust_id;
    if first.cust_id then output;
run;

data final_data;
    merge raw_data modes;
    by cust_id;
run;

proc print data=final_data noobs;
    title "final data";
run;

ods html close;

I tried to use proc means like this

proc means data=raw_data;
    class cust_id;
    var category;
    output out=modes mode=mode_category;
run;

but I get the error "Variable category in list does not match type prescribed for this list" because proc means doesn't support character variables.

SQL is certainly a fine way to do it. Here's a data step solution with a double DOW loop . You can also calculate the mean with this method in the same step if you want.

Sort the data in order to use by groups.

proc sort data=raw_data;
  by cust_id category;
run;

Read the data set once, counting each occurrence of a category by cust_id. The variable maxfreq stores the highest count, and the variable mode keeps the category with the highest count. Because the data is sorted by the category variable, this will return the highest alphabetical value in case of tie-breakers.

The second loop outputs the values along with the mode from the first loop.

data want(drop=freq maxfreq);
  do until (last.cust_id);
    set raw_data;
    by cust_id category;
    if first.category then freq=0;
    freq+1;
    maxfreq=max(freq,maxfreq);
    if freq=maxfreq then mode=category;
  end;
  do until (last.cust_id);
    set raw_data;
    by cust_id;
    output;
  end;
run;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM