[英]How to do conditional count based on row value in SAS/SQL?
Re-uploading since there was some problems with my last post, and I did not know that we were supposed to post sample data. 由于我的上一个帖子存在一些问题,因此重新上传,并且我不知道我们应该发布示例数据。 I'm fairly new to SAS, and I have a problem that I know how to solve in Excel but not SAS. 我是SAS的新手,但有一个我知道如何在Excel中而不是在SAS中解决的问题。 however, the dataset is too large to reasonably use in Excel. 但是,数据集太大,无法在Excel中合理使用。
I have four variables: id, year_start, groupname, test_score. 我有四个变量:id,year_start,groupname,test_score。
Sample data: 样本数据:
id year_start group_name test_score
1 19931231 Red 90
1 19941230 Red 89
1 19951231 Red 91
1 19961231 Red 92
2 19930630 Red 85
2 19940629 Red 87
2 19950630 Red 95
3 19950931 Blue 90
3 19960931 Blue 90
4 19930331 Red 95
4 19940331 Red 97
4 19950330 Red 98
4 19960331 Red 95
5 19931231 Red 96
5 19941231 Red 97
My goal is to achieve a ranked list (fractional) by test_score for each year. 我的目标是每年通过test_score获得排名列表(分数)。 I hoped that I would be able to achieve this using PROC RANK FRACTION. 我希望我能够使用PROC RANK FRACTION实现这一目标。 This function would calculate order by a test_score (highest is 1, 2nd highest is 2 and so on) and then divide by the total number of observations to provide a fractional rank. 此函数将按test_score(最高为1,第二最高为2,依此类推)计算顺序,然后除以观测总数以提供分数等级。 Unfortunately, year_start differs widely from row to row. 不幸的是,year_start在行与行之间差异很大。 For each id/year combo, I want to perform a one-year look-back from year-start, and rank that observation compared to all other id's that have a year_start in that one year range. 对于每个id / year组合,我都希望从年初开始进行一年的回溯,并将该观察结果与该year范围内具有year_start的所有其他id进行比较。 I'm not interested in comparing by calendar year, and the rank of each id should be relative to its own year_start. 我对按日历年进行比较不感兴趣,每个ID的等级应相对于其自己的year_start。 Adding another level of complication, I would like this rank to be performed by groupname. 添加另一级别的复杂性,我希望通过groupname进行此排名。
PROC SQL is totally fine if someone has a SQL solution. 如果有人有SQL解决方案,则PROC SQL完全可以。
Using the above data, the ranks would be like this: 使用以上数据,排名将如下所示:
id year_start group_name test_score rank
1 19931231 Red 90 0.75
1 19941230 Red 89 0.8
1 19951231 Red 91 1
1 19961231 Red 92 1
2 19930630 Red 85 1
2 19940629 Red 87 0.8
2 19950630 Red 95 0.75
3 19950931 Blue 90 1
3 19960931 Blue 90 1
4 19930331 Red 95 1
4 19940331 Red 97 0.2
4 19950330 Red 98 0.2
4 19960331 Red 95 0.333
5 19931231 Red 96 0.25
5 19941231 Red 97 0.667
In order to calculate the rank for row 1, 为了计算第1行的等级
In Excel, the formula for this variable would look something like this. 在Excel中,此变量的公式如下所示。 Assume formula is for row 1 and there are 100 rows. 假设公式适用于第1行,并且有100行。 id=A, year_start=B, groupname=C, and test_score=D: id = A,year_start = B,groupname = C和test_score = D:
=(1+countifs(D1:D100,">"&D1,
B1:B100,"<="&B1,
B1:B100,">"&B1-365.25,
C1:C100, C1))/
countifs(B1:B100,"<="&B1,
B1:B100,">"&B1-365.25,
C1:C100, C1)
Thanks so much for the help! 非常感谢你的帮助!
ahammond428 ahammond428
Your example isn't correct if I'm reading it correctly, so it's hard to know exactly what you're trying to do. 如果我没有正确阅读示例,则您的示例是不正确的,因此很难确切了解您要执行的操作。 But try the following and see if it works. 但是尝试以下方法,看看是否可行。 You may need to tweak inequalities to be open or closed depending on whether you want to include one year to the date. 您可能需要调整不平等程度,以决定是否开放一年,具体取决于您是否要在日期中加上一年。 Note that your year_start column needs to be imported in a SAS date format for this to work. 请注意,您的year_start列需要以SAS日期格式导入才能生效。 Otherwise you can change it over with input(year_start, yymmdd8.). 否则,您可以使用input(year_start,yymmdd8。)进行更改。
proc sql;
select distinct
a.id,
a.year_start,
a.group_name,
a.test_score,
1+sum(case when b.test_score > a.test_score then 1 else 0 end) as rank_num,
count(b.id) as rank_denom,
calculated rank_num / calculated rank_denom as rank
from testdata a left join testdata b
on a.group_name = b.group_name
and intnx('year',a.year_start,-1,'s') le b.year_start le a.year_start
group by a.id, a.year_start, a.group_name, a.test_score
order by id, year_start;
quit;
Note that I changed dates of 9/31 to 9/30 (since there is no 9/31), but left 3/30, 6/29, and 12/30 alone since perhaps that was intended, though the other dates seem to be quarter-end. 请注意,我将日期从9/31更改为9/30(因为没有9/31),但由于可能是故意的,所以单独保留了3 / 30、6 / 29和12/30,尽管其他日期似乎季度末。
Consider correlated count subqueries in SQL: 考虑SQL中的相关计数子查询:
DATA 数据
data ranktable;
infile datalines missover;
input id year_start group_name $ test_score;
datalines;
1 19931231 Red 90
1 19941230 Red 89
1 19951231 Red 91
1 19961231 Red 92
2 19930630 Red 85
2 19940629 Red 87
2 19950630 Red 95
3 19950930 Blue 90
3 19960930 Blue 90
4 19930331 Red 95
4 19940331 Red 97
4 19950330 Red 98
4 19960331 Red 95
5 19931231 Red 96
5 19941231 Red 97
;
run;
data ranktable;
set ranktable;
format year_start date9.;
year_start = input(put(year_start,z8.),yymmdd8.);
run;
PROC SQL PROC SQL
Additional fields included for your review 包括其他字段供您查看
proc sql;
select r.id, r.year_start, r.group_name, r.test_score,
put(intnx('year', r.year_start, -1, 's'), yymmdd10.) as year_ago,
(select count(*) from ranktable sub
where sub.test_score >= r.test_score
and sub.group_name = r.group_name
and sub.year_start <= r.year_start
and sub.year_start >= intnx('year', r.year_start, -1, 's')) as num_rank,
(select count(*) from ranktable sub
where sub.group_name = r.group_name
and sub.year_start <= r.year_start
and sub.year_start >= intnx('year', r.year_start, -1, 's')) as denom_rank,
calculated num_rank / calculated denom_rank as rank
from ranktable r;
run;
OUTPUT 输出值
You will notice a slight difference between your expected results which may be due to the quarter day (365.25) you apply for all years as SAS's intnx
takes one full calendar year in days which change with each year 您会发现,由于您的所有年份都申请了季度季度(365.25),因此预期结果之间会有细微的差异,因为SAS的intnx
需要一个完整的日历年,且该天数每年都在变化
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.