简体   繁体   English

如何计算SAS中组中字符串变量的模式?

[英]How do I calculate the mode of a string variable within a group in SAS?

I can calculate the mode using a subquery in proc sql , but is this the simplest way to do this? 我可以使用proc sql的子查询来计算模式,但这是最简单的方法吗? This code handles ties that might occur when calculating the mode by relying on however the max function in proc sql breaks ties. 该代码通过依靠proc sqlmax函数打破联系来处理在计算模式时可能发生的联系。

ods html file = "sas_output.html";

data raw_data;
 input cust_id $
       category $
       amount;
datalines;
A red 72.83
A red 80.22
A blue 0.2
A blue 33.62
A blue 30.63
A green 89.04
B blue 10.29
B red 97.07
B red 68.71
B red 98.2
B red 1.79
C green 92.94
C green 0.96
C red 15.23
D red 49.94
D blue 75.82
E blue 20.97
E blue 78.49
F green 87.92
F green 32.29
;
run;

proc sql;
  create table modes as 
    select cust_id, mean(amount) as mean_amount, category as mode_category
    from (
            select *, count(1) as count from raw_data group by cust_id, category
         )
    group by cust_id
    having count=max(count)
    order by cust_id;
quit;

data modes;
    set modes;
    by cust_id;
    if first.cust_id then output;
run;

data final_data;
    merge raw_data modes;
    by cust_id;
run;

proc print data=final_data noobs;
    title "final data";
run;

ods html close;

I tried to use proc means like this 我试图使用这样的proc means

proc means data=raw_data;
    class cust_id;
    var category;
    output out=modes mode=mode_category;
run;

but I get the error "Variable category in list does not match type prescribed for this list" because proc means doesn't support character variables. 但出现错误“列表中的变量类别与为此列表指定的类型不匹配”,因为proc means不支持字符变量。

SQL is certainly a fine way to do it. SQL当然是一种很好的方法。 Here's a data step solution with a double DOW loop . 这是带有双DOW循环的数据步骤解决方案。 You can also calculate the mean with this method in the same step if you want. 如果需要,您也可以在同一步骤中使用此方法计算平均值。

Sort the data in order to use by groups. 排序数据以便按组使用。

proc sort data=raw_data;
  by cust_id category;
run;

Read the data set once, counting each occurrence of a category by cust_id. 读取一次数据集,并按cust_id计数每次出现的类别。 The variable maxfreq stores the highest count, and the variable mode keeps the category with the highest count. 变量maxfreq存储最高计数,变量mode使类别具有最高计数。 Because the data is sorted by the category variable, this will return the highest alphabetical value in case of tie-breakers. 因为数据是按类别变量排序的,所以在平局的情况下,这将返回最高的字母值。

The second loop outputs the values along with the mode from the first loop. 第二个循环将值与第一个循环的模式一起输出。

data want(drop=freq maxfreq);
  do until (last.cust_id);
    set raw_data;
    by cust_id category;
    if first.category then freq=0;
    freq+1;
    maxfreq=max(freq,maxfreq);
    if freq=maxfreq then mode=category;
  end;
  do until (last.cust_id);
    set raw_data;
    by cust_id;
    output;
  end;
run;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM