如何在SAS / SQL中基于行值进行条件计数？

Question

由于我的上一个帖子存在一些问题，因此重新上传，并且我不知道我们应该发布示例数据。 我是SAS的新手，但有一个我知道如何在Excel中而不是在SAS中解决的问题。 但是，数据集太大，无法在Excel中合理使用。

我有四个变量：id，year_start，groupname，test_score。

样本数据：

id     year_start     group_name     test_score
1       19931231          Red            90
1       19941230          Red            89
1       19951231          Red            91
1       19961231          Red            92
2       19930630          Red            85
2       19940629          Red            87
2       19950630          Red            95
3       19950931          Blue           90
3       19960931          Blue           90
4       19930331          Red            95
4       19940331          Red            97
4       19950330          Red            98
4       19960331          Red            95
5       19931231          Red            96
5       19941231          Red            97

我的目标是每年通过test_score获得排名列表（分数）。 我希望我能够使用PROC RANK FRACTION实现这一目标。 此函数将按test_score（最高为1，第二最高为2，依此类推）计算顺序，然后除以观测总数以提供分数等级。 不幸的是，year_start在行与行之间差异很大。 对于每个id / year组合，我都希望从年初开始进行一年的回溯，并将该观察结果与该year范围内具有year_start的所有其他id进行比较。 我对按日历年进行比较不感兴趣，每个ID的等级应相对于其自己的year_start。 添加另一级别的复杂性，我希望通过groupname进行此排名。

如果有人有SQL解决方案，则PROC SQL完全可以。

使用以上数据，排名将如下所示：

id     year_start     group_name     test_score     rank
1       19931231          Red            90         0.75
1       19941230          Red            89          0.8
1       19951231          Red            91           1
1       19961231          Red            92           1
2       19930630          Red            85           1
2       19940629          Red            87          0.8
2       19950630          Red            95         0.75
3       19950931          Blue           90           1
3       19960931          Blue           90           1
4       19930331          Red            95           1
4       19940331          Red            97          0.2
4       19950330          Red            98          0.2
4       19960331          Red            95         0.333
5       19931231          Red            96         0.25
5       19941231          Red            97         0.667

为了计算第1行的等级

我们首先排除蓝色的观察结果。
然后，我们计算在year_start 19931231之前的一年内的观测值数量（因此，我们有4个观测值）。
我们计算这些观察中有多少具有较高的test_score，然后加1来查找当前观察的顺序（因此它是第三高）。
然后，将顺序除以总数即可得出排名（3/4 = 0.75）。

在Excel中，此变量的公式如下所示。 假设公式适用于第1行，并且有100行。 id = A，year_start = B，groupname = C和test_score = D：

      =(1+countifs(D1:D100,">"&D1, 
                B1:B100,"<="&B1,
                B1:B100,">"&B1-365.25,
                C1:C100, C1))/
       countifs(B1:B100,"<="&B1,
                B1:B100,">"&B1-365.25,
                C1:C100, C1)

非常感谢你的帮助！

ahammond428

Answer 1

如果我没有正确阅读示例，则您的示例是不正确的，因此很难确切了解您要执行的操作。 但是尝试以下方法，看看是否可行。 您可能需要调整不平等程度，以决定是否开放一年，具体取决于您是否要在日期中加上一年。 请注意，您的year_start列需要以SAS日期格式导入才能生效。 否则，您可以使用input（year_start，yymmdd8。）进行更改。

proc sql;
select distinct
    a.id,
    a.year_start,
    a.group_name,
    a.test_score,
    1+sum(case when b.test_score > a.test_score then 1 else 0 end) as rank_num,
    count(b.id) as rank_denom,
    calculated rank_num / calculated rank_denom as rank
from testdata a left join testdata b
    on a.group_name = b.group_name
    and intnx('year',a.year_start,-1,'s') le b.year_start le a.year_start
group by a.id, a.year_start, a.group_name, a.test_score
order by id, year_start;
quit;

请注意，我将日期从9/31更改为9/30（因为没有9/31），但由于可能是故意的，所以单独保留了3 / 30、6 / 29和12/30，尽管其他日期似乎季度末。

Answer 2

考虑SQL中的相关计数子查询：

数据

data ranktable;   
    infile datalines missover;  
    input id year_start group_name $ test_score; 
    datalines; 
1       19931231          Red            90
1       19941230          Red            89
1       19951231          Red            91
1       19961231          Red            92
2       19930630          Red            85
2       19940629          Red            87
2       19950630          Red            95
3       19950930          Blue           90
3       19960930          Blue           90
4       19930331          Red            95
4       19940331          Red            97
4       19950330          Red            98
4       19960331          Red            95
5       19931231          Red            96
5       19941231          Red            97
; 
run;

data ranktable;
    set ranktable;          
    format year_start date9.;
    year_start = input(put(year_start,z8.),yymmdd8.);
run;

PROC SQL

包括其他字段供您查看

proc sql;
    select r.id, r.year_start, r.group_name, r.test_score, 
           put(intnx('year', r.year_start, -1, 's'), yymmdd10.) as year_ago,
           (select count(*) from ranktable sub 
            where sub.test_score >= r.test_score
            and sub.group_name = r.group_name
            and sub.year_start <= r.year_start
            and sub.year_start >= intnx('year', r.year_start, -1, 's')) as num_rank,    
           (select count(*) from ranktable sub 
            where sub.group_name = r.group_name
            and sub.year_start <= r.year_start
            and sub.year_start >= intnx('year', r.year_start, -1, 's')) as denom_rank,    
           calculated num_rank / calculated denom_rank as rank
    from ranktable r;
run;

输出值

您会发现，由于您的所有年份都申请了季度季度（365.25），因此预期结果之间会有细微的差异，因为SAS的intnx需要一个完整的日历年，且该天数每年都在变化

如何在SAS / SQL中基于行值进行条件计数？

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-04-07 16:11:30

解决方案2
0 2017-04-07 17:29:55

如何在SAS / SQL中基于行值进行条件计数？

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-04-07 16:11:30

解决方案2 0 2017-04-07 17:29:55

解决方案1
1 已采纳 2017-04-07 16:11:30

解决方案2
0 2017-04-07 17:29:55