简体   繁体   中英

How can I loop multiple datasets through a data step or proc sql query SAS?

I have multiple datasets (100+) that all contain the same 3 columns (code_num, replicate, total_qty) each with a distinct code (code_num).

data code_num_1
code_num replicate total_qty
12345       376       45
12345       76        67
12345       943       300
.
.

data code_num_2
code_num replicate total_qty
12234       85       746
12234       900      35
12234       726      273
.
.

and etc.

I would like to run those datasets through a data step if possible:

data test;
set test_; <-- datasets will go here...
if _N_ in(&PercentileRow10,&PercentileRow20,&PercentileRow30,&PercentileRow40,&PercentileRow50,&PercentileRow60,&PercentileRow70, &PercentileRow80,&PercentileRow90);
run;

*Note: &percentilerow is a macro variable that will obtain the percentiles from the datasets. The column quantity will determine percentiles. I have this step beforehand:

proc sql no print; 

create table ___ as select code_num, replicate, sum(qty) as total_qty from ____ group by code_num, replicate order by total_qty; quit;

Ideally, I would like to obtain the percentiles of each dataset and create a new dataset that will have each percentile and the associated replicate it occurred and the total quantity. Could I use a macro and do loop to run my datasets through this data set to produce new datasets?

data code_num_1_perc
percentile replicate qty
10           87      45
20           933     65
30           34      100
.
.
90           467      837

This is my ideal output for each dataset code_num_#. If possible

If I understand the requirements correct, the proposed methodology is flawed.

For example, the median (50th percentile) of a series such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 is 5.5. 5.5 is not a value in the data set so how would a replicate number be selected?

My recommendation would be a different process altogether. Look into PROC RANK to see how ties are handled and how you'd like them handled. You didn't specify which variable would used to calculate the percentiles.

  1. Combine all data sets into one, adding in a data set identifier to uniquely identify each data set.
data combined;
length source data_set_name $50.;
set code_num_: indsname = source;
data_set_name = source;
run;
  1. Use PROC RANK to group into deciles
proc rank data=combined out=combined_deciles groups=10;
by data_set_name;
var total_qty;
ranks PRanks;
run;
  1. Get the first (or last, based on requirements) value for each rank
data want;
set combined_deciles;
by datasetName Pranks;
if first.Pranks;
run;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM