简体   繁体   English

如何在SAS / SQL中基于行值进行条件计数?

[英]How to do conditional count based on row value in SAS/SQL?

Re-uploading since there was some problems with my last post, and I did not know that we were supposed to post sample data. 由于我的上一个帖子存在一些问题,因此重新上传,并且我不知道我们应该发布示例数据。 I'm fairly new to SAS, and I have a problem that I know how to solve in Excel but not SAS. 我是SAS的新手,但有一个我知道如何在Excel中而不是在SAS中解决的问题。 however, the dataset is too large to reasonably use in Excel. 但是,数据集太大,无法在Excel中合理使用。

I have four variables: id, year_start, groupname, test_score. 我有四个变量:id,year_start,groupname,test_score。

Sample data: 样本数据:

id     year_start     group_name     test_score
1       19931231          Red            90
1       19941230          Red            89
1       19951231          Red            91
1       19961231          Red            92
2       19930630          Red            85
2       19940629          Red            87
2       19950630          Red            95
3       19950931          Blue           90
3       19960931          Blue           90
4       19930331          Red            95
4       19940331          Red            97
4       19950330          Red            98
4       19960331          Red            95
5       19931231          Red            96
5       19941231          Red            97

My goal is to achieve a ranked list (fractional) by test_score for each year. 我的目标是每年通过test_score获得排名列表(分数)。 I hoped that I would be able to achieve this using PROC RANK FRACTION. 我希望我能够使用PROC RANK FRACTION实现这一目标。 This function would calculate order by a test_score (highest is 1, 2nd highest is 2 and so on) and then divide by the total number of observations to provide a fractional rank. 此函数将按test_score(最高为1,第二最高为2,依此类推)计算顺序,然后除以观测总数以提供分数等级。 Unfortunately, year_start differs widely from row to row. 不幸的是,year_start在行与行之间差异很大。 For each id/year combo, I want to perform a one-year look-back from year-start, and rank that observation compared to all other id's that have a year_start in that one year range. 对于每个id / year组合,我都希望从年初开始进行一年的回溯,并将该观察结果与该year范围内具有year_start的所有其他id进行比较。 I'm not interested in comparing by calendar year, and the rank of each id should be relative to its own year_start. 我对按日历年进行比较不感兴趣,每个ID的等级应相对于其自己的year_start。 Adding another level of complication, I would like this rank to be performed by groupname. 添加另一级别的复杂性,我希望通过groupname进行此排名。

PROC SQL is totally fine if someone has a SQL solution. 如果有人有SQL解决方案,则PROC SQL完全可以。

Using the above data, the ranks would be like this: 使用以上数据,排名将如下所示:

id     year_start     group_name     test_score     rank
1       19931231          Red            90         0.75
1       19941230          Red            89          0.8
1       19951231          Red            91           1
1       19961231          Red            92           1
2       19930630          Red            85           1
2       19940629          Red            87          0.8
2       19950630          Red            95         0.75
3       19950931          Blue           90           1
3       19960931          Blue           90           1
4       19930331          Red            95           1
4       19940331          Red            97          0.2
4       19950330          Red            98          0.2
4       19960331          Red            95         0.333
5       19931231          Red            96         0.25
5       19941231          Red            97         0.667

In order to calculate the rank for row 1, 为了计算第1行的等级

  • we first exclude blue observations. 我们首先排除蓝色的观察结果。
  • Then we count the number of observations that fall within a year before that year_start, 19931231 (so we have 4 observations). 然后,我们计算在year_start 19931231之前的一年内的观测值数量(因此,我们有4个观测值)。
  • We count how many of these observations have a higher test_score, and then add 1 to find the order of the current observation (So it is the 3rd highest). 我们计算这些观察中有多少具有较高的test_score,然后加1来查找当前观察的顺序(因此它是第三高)。
  • Then, we divide the order by the total number to get the rank (3/4= 0.75). 然后,将顺序除以总数即可得出排名(3/4 = 0.75)。

In Excel, the formula for this variable would look something like this. 在Excel中,此变量的公式如下所示。 Assume formula is for row 1 and there are 100 rows. 假设公式适用于第1行,并且有100行。 id=A, year_start=B, groupname=C, and test_score=D: id = A,year_start = B,groupname = C和test_score = D:

      =(1+countifs(D1:D100,">"&D1, 
                B1:B100,"<="&B1,
                B1:B100,">"&B1-365.25,
                C1:C100, C1))/
       countifs(B1:B100,"<="&B1,
                B1:B100,">"&B1-365.25,
                C1:C100, C1) 

Thanks so much for the help! 非常感谢你的帮助!

ahammond428 ahammond428

Your example isn't correct if I'm reading it correctly, so it's hard to know exactly what you're trying to do. 如果我没有正确阅读示例,则您的示例是不正确的,因此很难确切了解您要执行的操作。 But try the following and see if it works. 但是尝试以下方法,看看是否可行。 You may need to tweak inequalities to be open or closed depending on whether you want to include one year to the date. 您可能需要调整不平等程度,以决定是否开放一年,具体取决于您是否要在日期中加上一年。 Note that your year_start column needs to be imported in a SAS date format for this to work. 请注意,您的year_start列需要以SAS日期格式导入才能生效。 Otherwise you can change it over with input(year_start, yymmdd8.). 否则,您可以使用input(year_start,yymmdd8。)进行更改。

proc sql;
select distinct
    a.id,
    a.year_start,
    a.group_name,
    a.test_score,
    1+sum(case when b.test_score > a.test_score then 1 else 0 end) as rank_num,
    count(b.id) as rank_denom,
    calculated rank_num / calculated rank_denom as rank
from testdata a left join testdata b
    on a.group_name = b.group_name
    and intnx('year',a.year_start,-1,'s') le b.year_start le a.year_start
group by a.id, a.year_start, a.group_name, a.test_score
order by id, year_start;
quit;

Note that I changed dates of 9/31 to 9/30 (since there is no 9/31), but left 3/30, 6/29, and 12/30 alone since perhaps that was intended, though the other dates seem to be quarter-end. 请注意,我将日期从9/31更改为9/30(因为没有9/31),但由于可能是故意的,所以单独保留了3 / 30、6 / 29和12/30,尽管其他日期似乎季度末。

Consider correlated count subqueries in SQL: 考虑SQL中的相关计数子查询:

DATA 数据

data ranktable;   
    infile datalines missover;  
    input id year_start group_name $ test_score; 
    datalines; 
1       19931231          Red            90
1       19941230          Red            89
1       19951231          Red            91
1       19961231          Red            92
2       19930630          Red            85
2       19940629          Red            87
2       19950630          Red            95
3       19950930          Blue           90
3       19960930          Blue           90
4       19930331          Red            95
4       19940331          Red            97
4       19950330          Red            98
4       19960331          Red            95
5       19931231          Red            96
5       19941231          Red            97
; 
run;

data ranktable;
    set ranktable;          
    format year_start date9.;
    year_start = input(put(year_start,z8.),yymmdd8.);
run;

PROC SQL PROC SQL

Additional fields included for your review 包括其他字段供您查看

proc sql;
    select r.id, r.year_start, r.group_name, r.test_score, 
           put(intnx('year', r.year_start, -1, 's'), yymmdd10.) as year_ago,
           (select count(*) from ranktable sub 
            where sub.test_score >= r.test_score
            and sub.group_name = r.group_name
            and sub.year_start <= r.year_start
            and sub.year_start >= intnx('year', r.year_start, -1, 's')) as num_rank,    
           (select count(*) from ranktable sub 
            where sub.group_name = r.group_name
            and sub.year_start <= r.year_start
            and sub.year_start >= intnx('year', r.year_start, -1, 's')) as denom_rank,    
           calculated num_rank / calculated denom_rank as rank
    from ranktable r;
run;

OUTPUT 输出值

You will notice a slight difference between your expected results which may be due to the quarter day (365.25) you apply for all years as SAS's intnx takes one full calendar year in days which change with each year 您会发现,由于您的所有年份都申请了季度季度(365.25),因此预期结果之间会有细微的差异,因为SAS的intnx需要一个完整的日历年,且该天数每年都在变化

Proc SQL输出

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM