简体   繁体   English

SAS有循环+滞后功能吗?

[英]SAS do loop + lag function?

This is my first post, so please let me know if I'm not clear enough. 这是我的第一篇文章,所以如果我不够清楚,请告诉我。 Here's what I'm trying to do - this is my dataset. 这就是我想要做的-这是我的数据集。 My approach for this is a do loop with a lag but the result is rubbish. 我的方法是执行循环,但会产生垃圾。

data a;
input @1 obs @4 mindate mmddyy10. @15 maxdate mmddyy10.;
format mindate maxdate date9.;
datalines;
1   01/02/2013 01/05/2013
2   01/02/2013 01/05/2013
3   01/02/2013 01/05/2013
4   01/03/2013 01/06/2013
5   02/02/2013 02/08/2013
6   02/02/2013 02/08/2013
7   02/02/2013 02/08/2013
8   03/10/2013 03/11/2013
9   04/02/2013 04/22/2013
10  04/10/2013 04/22/2013
11  05/04/2013 05/07/2013
12  06/10/2013 06/20/2013
;
run;

Now, I'm trying to produce a new column - "Replacement" based on the following logic: 现在,我正在尝试根据以下逻辑生成新列-“替换”:

  1. If a record's mindate occurs before its lag's maxdate, it cannot be a replacement for it. 如果记录的注意事项是在滞后的最大日期之前发生的,则不能替代记录。 If it cannot be a replacement, skip forward (so- 2,3,4 cannot replace 1, but 5 can). 如果它不能替代,请向前跳过(因此,2,3,4不能替代1,但是5可以)。
  2. Otherwise... if the mindate is less than 30 days, Replacement = Y. If not, replacement = N. Once a record replaces another (so, in this case, 5 does replace 1, because 02/02/2013 is <30 than 01/05/2013, it cannot duplicate as a replacement for another record. But if it's an N for one record above, it can still be a Y for some other record. So, 6 is now evaluated against 2, 7 against 3,etc. Since those two combos are both "Y", 8 is now evaluated versus 4, but because its mindate >30 relative to 4's maxdate, it's a N. But, it's then evaluated against against 否则...如果主意少于30天,则替换=Y。否则,替换=N。一旦一条记录替换了另一条记录(因此,在这种情况下,5确实替换了1,因为02/02/2013是<30比2013年1月5日大,它不能替代另一个记录,但是如果上面一个记录为N,那么对于其他记录,它仍然可以为Y。因此,现在针对2、7和3对6求值。 ,等等。由于这两个组合都是“ Y”,因此现在将8与4相比较,但是由于相对于4的maxdate而言,它的介意> 30,因此它是N。但是,针对
  3. And so on... 等等...

I should that in a 100 record dataset, this would imply that the 100th record could technically replace the 1st, so I've been trying lags within loops. 我认为在100条记录的数据集中,这意味着第100条记录可以从技术上代替第1条记录,因此我一直在尝试滞后于循环。 Any tips/help is greatly appreciated! 任何提示/帮助,不胜感激! Expected output: 预期产量:

                      obs      mindate      maxdate    Replacement

                        1    02JAN2013    05JAN2013
                        2    02JAN2013    05JAN2013
                        3    02JAN2013    05JAN2013
                        4    03JAN2013    06JAN2013
                        5    02FEB2013    08FEB2013         Y
                        6    02FEB2013    08FEB2013         Y
                        7    02FEB2013    08FEB2013         Y
                        8    10MAR2013    11MAR2013         Y
                        9    02APR2013    22APR2013         Y
                       10    10APR2013    22APR2013         N
                       11    04MAY2013    07MAY2013         Y
                       12    10JUN2013    20JUN2013         Y

Here is a solution using SQL and hash tables. 这是使用SQL和哈希表的解决方案。 It is not optimal but it was the first method that sprang to mind. 它不是最佳方法,但它是第一个想到的方法。

/* Join the input with its self */
proc sql;
    create table b as
    select 
        a1.obs, 
        a2.obs as obs2
    from a as a1
    inner join a as a2
        /* Set the replacement criteria */
        on a1.maxdate < a2.mindate <= a1.maxdate + 30
    order by a2.obs, a1.obs;
quit;
/* Create a mapping for replacements */
data c;
    set b;
    /* Create two empty hash tables so we can look up the used observations */
    if _N_ = 1 then do;
        declare hash h();
        h.definekey("obs");
        h.definedone(); 
        declare hash h2();
        h2.definekey("obs2");
        h2.definedone();
    end;
    /* Check if we've already used this observation as a replacement */
    if h2.find() then do;
        /* Check if we've already replaced his observation  */
        if h.find() then do;
            /* Add the observations to the hash table and output */
            h2.add();
            h.add();
            output;
        end;
    end;
run;
/* Combine the replacement map with the original data */
proc sql;
    select 
        a.*, 
        ifc(c.obs, "Y", "N") as Replace, 
        c.obs as Replaces
    from a
    left join c
        on a.obs = c.obs2
    order by a.obs;
quit;

There are several ways in which this can be simplified: 有几种方法可以简化此过程:

  • The dates can be brought through the first proc sql 日期可以通过第一个proc sql带来
  • The if statements can be combined if语句可以合并
  • The final join could be replaced by a little extra logic in the data step 在数据步骤中,最后的联接可以用一些额外的逻辑代替

I think this is correct if the asker was mistaken about replacement = Y for obs = 12. 我认为这是正确的,如果问问者被误认为替换= Y等于obs = 12。

/*Get number of obs so we can build a temporary array to hold the dataset*/
data _null_;
    set have nobs= nobs;
    call symput("nobs",nobs);
    stop;
run;

data want;
    /*Load the dataset into a temporary array*/
    array dates[2,&NOBS] _temporary_;
    if _n_ = 1 then do _n_ = 1 by 1 until(eof);
        set have end = eof;
        dates[1,_n_] = maxdate;
        dates[2,_n_] = 0;
    end;

    set have;

    length replacement $1;

    replacement = 'N';
    do i = 1 to _n_ - 1 until(replacement = 'Y');
        if dates[2,i] = 0 and 0 <= mindate - dates[1,i] <= 30 then do;
            replacement = 'Y';
            dates[2,i] = _n_;
            replaces = i;
        end;
    end;
    drop i; 
run;

You could use a hash object + hash iterator instead of a temporary array if you preferred. 如果愿意,可以使用哈希对象+哈希迭代器代替临时数组。 I've also included an extra var, replaces , to show which previous row each row replaces. 我还包括一个额外的vars replaces ,以显示每行替换的前一行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM