简体   繁体   English

SAS 在 DO 直到循环中集成 RETAIN 和计算新变量

[英]SAS integrating RETAIN and computing new variables within DO until loop

I have been struggling with some code.我一直在努力处理一些代码。 I have clinical data where each row represents one admission.我有临床数据,其中每一行代表一次入院。 I would like to restructure the data so that each row represents one patients.我想重组数据,使每一行代表一名患者。 However, there is a number of tasks I would like to conduct during the process:但是,在此过程中我想执行许多任务:

  1. Patients specific information is mostly, but now always contained in the first admission.患者的具体信息大多是,但现在总是包含在第一次入院时。 For example gender is always on the first admission, but the date of death (deathdate) can be in the row of any admission.例如性别总是在第一次入院时,但死亡日期(deathdate)可以在任何入院的行中。 That's why I would like to retain the value of deathdate when not missing !这就是为什么我想在不丢失时保留 deathdate 的价值!

  2. I would like to keep some specific data from the first admission always such as tumor stage.我想始终保留第一次入院时的一些具体数据,例如肿瘤分期。

  3. I would like to conduct some operations between values from different admissions for example: calculate the difference in duration between the two admission dates.我想在不同录取的值之间进行一些操作,例如:计算两个录取日期之间的持续时间差异。

How can I do this most efficiently with the lowest number of new variables created, smallest number of DATA, DROP and RENAME statements?我怎样才能用最少数量的创建新变量、最少数量的 DATA、DROP 和 RENAME 语句最有效地做到这一点?

Please find examples of the data I have and what I want further.请找到我拥有的数据示例以及我想要的进一步信息。

I used to use DO loop like this我曾经像这样使用 DO 循环

data want;
    do until (Last.ID);
set have;
by ID;
    select (admission);
        when ('1') do; GenderNew = Gender; StageNew = stage; deathdate1 = deathdate; admission_date1 = admission_date; end;
        when ('2') do; deathdate2 = deathdate; admission_date2 = admission_date; end;
    otherwise;
    end;
end;
drop admission gender stage deathdate admission_date;
run;

data want; set want;
format deathdate ddmmyy10.;
rename GenderNew = Gender StageNew = Stage;
Duration = admission_date2 - admission_date1;
deathdate = max(deathdate2, deathdate1);
drop admission_date1 admission_date2 deathdate1 deathdate2;
run;

However, my method is annoying.但是,我的方法很烦人。 Specially, that I need to create many new variables from the first observation instead of retaining them somehow.特别是,我需要从第一次观察中创建许多新变量,而不是以某种方式保留它们。 I have about 100 variables that I need to keep and it does not makes sense to make 100 new variables.我有大约 100 个变量需要保留,创建 100 个新变量没有意义。

Is there a more efficient way?有没有更有效的方法?

Thanks in forward.感谢转发。

Data example:数据示例:

data have;
input id admission gender $ stage admission_date deathdate;
format deathdate ddmmyy10.;
cards;
1 1 m 2 5000 .
1 2 . . 5100 6500
2 1 f 1 5600 6600
2 2 . . 5900 .
3 1 f 4 5627 .
3 2 . 3 5830 7000
3 3 . 1 6000 .
;
run;

data want;
input id gender stage Duration deathdate;
format deathdate ddmmyy10.;
cards;
1 m 2 100 6500
2 f 1 300 6600
3 f 4 373 7000
;
run;

My favorite way to "flatten" datasets is the self-update.我最喜欢“扁平化”数据集的方法是自我更新。

data want;
  update have(obs=0) have;
  by id;
  keep id gender stage deathdate;
run;
  

Now, this doesn't calculate duration, but that's not hard to add.现在,这不计算持续时间,但这并不难添加。

data want;
  update have(obs=0) have;
  by id;
  retain first_stage;
  rename first_stage = stage;
  keep id gender first_stage duration deathdate;
  duration = admission_date-lag(admission_date);
  if first.id then first_stage = stage;
run;
  

This will only be actually saved for the last record on each case;这只会实际保存在每个案例的最后记录中; if they all have 2 (or more) then you don't need to qualify it, otherwise add another line after if first.id then call missing(duration);如果他们都有 2 个(或更多)那么你不需要限定它,否则在if first.id then call missing(duration); (which doesn't hurt in any event, just don't put the if around the duration calculation as then lag doesn't work properly). (这在任何情况下都不会造成伤害,只是不要将 if 放在持续时间计算周围,因为延迟无法正常工作)。

This may not solve your other issue, though, as I don't know why you'd have 100s of variables.不过,这可能无法解决您的其他问题,因为我不知道您为什么会有 100 个变量。 The other simple option is proc transpose and then work with what you get out of that.另一个简单的选项是proc transpose ,然后处理你从中得到的东西。

I propose the following scalable solution.我提出以下可扩展的解决方案。

It is scalable because you don't need to create copies of the 100 variables you mention... Instead you just need to list the variables whose value to keep should be the value of the first record by ID, and the variables whose value to keep should be the last non-missing value.它是可扩展的,因为您不需要创建您提到的 100 个变量的副本...相反,您只需要列出要保留的值应该是 ID 的第一个记录的值的变量,以及其值要保留的变量keep 应该是最后一个非缺失值。

Note: The code where you create the have and want datasets is copied here with two fixes: (i) respective $ symbols were added to the two input statements to state that gender is character;注意:您创建havewant数据集的代码已复制到此处,并进行了两个修复:(i) 将相应的$符号添加到 state 的两个input语句中, gender是字符; (ii) the calculation of Duration for id = 2 in the want dataset was fixed to the correct value of 300 . (ii) 将want的数据集中id = 2Duration计算固定为正确的值300

data have;
input id admission gender $ stage admission_date deathdate;
format deathdate ddmmyy10.;
cards;
1 1 m 2 5000 .
1 2 . . 5100 6500
2 1 f 1 5600 6600
2 2 . . 5900 .
3 1 f 4 5627 .
3 2 . 3 5830 .
;
run;

data want;
input id gender $ stage Duration deathdate;
format deathdate ddmmyy10.;
cards;
1 m 2 100 6500
2 f 1 300 6600
3 f 4 203 .
;
run;


/* PARAMETER DEFINITION
Define the variables to keep from first record by ID and the variables to keep
from last non-missing occurrence.

Note that we need to separate the CHAR variables from the NUMERIC variables
because the variables are listed in an ARRAY below and ARRAY variables must all 
be of the same type.

The same should be done for the last non-missing value to keep if there happen
to be both CHAR and NUMERIC variables (which is not the case here).

The code below assumes that there is at least ONE variable for each of the
concerned variables.
If this is not always the case, appropriate %IF statements could be added to
check if the number of variables is equal to 0 before defining their respective
array and updating them. Search for SAS macro programming.
*/

* Variables to keep from first record by ID;
%let keep_from_first_record_char = gender;
%let keep_from_first_record_num = id stage;

* Variables to keep from last non-missing occurrence;
%let keep_last_non_missing = deathdate;

* Number of variables in each group (used when defining the arrays below);
%let n_keep_from_first_record_char = %sysfunc(countw(&keep_from_first_record_char));
%let n_keep_from_first_record_num = %sysfunc(countw(&keep_from_first_record_num));
%let n_keep_last_non_missing = %sysfunc(countw(&keep_last_non_missing));


/* DATA PROCESS
Flatten the input dataset to one record per ID
*/

* The dataset is assumed to be sorted by ID;
data flattened;
    format id gender stage Duration deathdate;
    keep id gender stage Duration deathdate;
    set have;

    * Array definition of permanent variables;
    array arr_keep_from_first_record_char(*) $ &keep_from_first_record_char;
    array arr_keep_from_first_record_num(*) &keep_from_first_record_num;
    array arr_keep_last_non_missing(*) &keep_last_non_missing;

    * Array definition of temporary variables;
    array tmp_keep_from_first_record_char(&n_keep_from_first_record_char) $ _ctmp1-_ctmp&n_keep_from_first_record_char;
    array tmp_keep_from_first_record_num(&n_keep_from_first_record_num) _ntmp1-_ntmp&n_keep_from_first_record_num;
    array tmp_keep_last_non_missing(&n_keep_last_non_missing) _TEMPORARY_;
    * Retain the variables that store the first observed value by ID;
    retain _ctmp1-_ctmp&n_keep_from_first_record_char;
    retain _ntmp1-_ntmp&n_keep_from_first_record_num;

    * ID variable that groups the records;
    by id;

    * 1) Store first observed value for keep_from_first and set to missing keep_last_non_missing;
    if first.id then do;
        do i = 1 to dim(arr_keep_from_first_record_char);
            tmp_keep_from_first_record_char(i) = arr_keep_from_first_record_char(i);
        end;
        do i = 1 to dim(arr_keep_from_first_record_num);
            tmp_keep_from_first_record_num(i) = arr_keep_from_first_record_num(i);
        end;
        do i = 1 to dim(arr_keep_last_non_missing);
            call missing(tmp_keep_last_non_missing(i));
        end;
    end;

    * 2) Store last non-missing value found;
    do i = 1 to dim(arr_keep_last_non_missing);
        if not missing(arr_keep_last_non_missing(i)) then
            tmp_keep_last_non_missing(i) = arr_keep_last_non_missing(i);
    end;

    * 3) Compute other variables;
    admission_date_prev = lag(admission_date);
    if not first.id then
        Duration = admission_date - lag(admission_date);

    * 4) Set values of variables to output for this ID;
    if last.id then do;
        do i = 1 to dim(arr_keep_from_first_record_char);
            arr_keep_from_first_record_char(i) = tmp_keep_from_first_record_char(i);
            put i= arr_keep_from_first_record_char(i) tmp_keep_from_first_record_char(i);
        end;
        do i = 1 to dim(arr_keep_from_first_record_num);
            arr_keep_from_first_record_num(i) = tmp_keep_from_first_record_num(i);
        end;
        do i = 1 to dim(arr_keep_last_non_missing);
            arr_keep_last_non_missing(i) = tmp_keep_last_non_missing(i);
        end;
        output;
    end;

    drop i;
run;

* Check if flattened = want;
proc compare base=want compare=flattened; run;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM