简体   繁体   中英

Split SAS datasets by column with primary key

So I have a dataset with one primary key: unique_id and 1200 variables. This dataset is generated from a macro so the number of columns will not be fixed. I need to split this dataset into 4 or more datasets of 250 variables each, and each of these smaller datasets should contain the primary key so that I can merge them back later. Can somebody help me with either a sas function or a macro to solve this? Thanks in advance.

A simple way to split a datasets in the way you request is to use a single data step with multiple output datasets where each one has a KEEP= dataset option listing the variables to keep. For example:

data split1(keep=Name Age Height) split2(keep=Name Sex Weight);
  set sashelp.class;
run;

So you need to get the list of variables and group then into sets of 250 or less. Then you can use those groupings to generate code like above. Here is one method using PROC CONTENTS to get the list of variables and CALL EXECUTE() to generate the code.

I will use macro variables to hold the name of the input dataset, the key variable that needs to be kept on each dataset and maximum number of variables to keep in each dataset.

So for the example above those macro variable values would be:

%let ds=sashelp.class;
%let key=name;
%let nvars=2;

So use PROC CONTENTS to get the list of variable names:

proc contents data=&ds noprint out=contents; run;

Now run a data step to split them into groups and generate a member name to use for the new split dataset. Make sure not to include the KEY variable in the list of variables when counting.

data groups;
  length group 8 memname $41 varnum 8 name $32 ;
  group +1;
  memname=cats('split',group);
  do varnum=1 to &nvars while (not eof);
    set contents(keep=name where=(upcase(name) ne %upcase("&key"))) end=eof;
    output;
  end;
run;

Now you can use that dataset to drive the generation of the code:

data _null_;
  set groups end=eof;
  by group;
  if _n_=1 then call execute('data ');
  if first.group then call execute(cats(memname,'(keep=&key'));
  call execute(' '||trim(name));
  if last.group then call execute(') ');
  if eof then call execute(';set &ds;run;');
run;

Here are results from the SAS log:

NOTE: CALL EXECUTE generated line.
1    + data
2    + split1(keep=name
3    +  Age
4    +  Height
5    + )
6    + split2(keep=name
7    +  Sex
8    +  Weight
9    + )
10   + ;set sashelp.class;run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.SPLIT1 has 19 observations and 3 variables.
NOTE: The data set WORK.SPLIT2 has 19 observations and 3 variables.

Just another way of doing it using macro variables:

/* Number of columns you want in each chunk */

%let vars_per_part = 250;

/* Get all the column names into a dataset */

proc contents data = have out=cols noprint;
run;

%macro split(part);

  /* Split the columns into 250 chunks for each part and put it into a macro variable */

  %let fobs = %eval((&part - 1)* &vars_per_part + 1);
  %let obs  = %eval(&part * &vars_per_part);
  proc sql noprint;
    select name into :cols separated by " " from cols (firstobs =  &fobs obs = &obs) where name ~= "uniq_id";
    quit;

  /* Chunk up the data only keeping those varaibles and the uniq_id */
  data want_part∂
    set have (keep = &cols uniq_id);
  run;

%mend;

/* Run this from 1 to whatever the increment required to cover all the columnns */

%split(1);
%split(2);
%split(3);

this is not a complete solution but some help to give you another insight into how to solve this. The previous solutions have relied much on proc contents and data step, but I would solve this using proc sql and dictionary.columns . And I would create a macro that would split the original file into as many parts as needed, 250 cols each. The steps roughly:

proc sql; create table as _colstemp as select * from dictionary.columns where library='your library' and memname = 'your table' and name ne 'your primary key'; quit;

Count the number of files needed somewhere along:

proc sql; 
    select ceil(count(*)/249) into :num_of_datasets from _colstemp; 
    select count(*) into :num_of_cols from _colstemp; 
quit;

Then just loop over the original dataset like:

%do &_i = 1 %to &num_of_datasets

proc sql; 
    select name into :vars separated by ',' 
    from _colstemp(firstobs=%eval((&_i. - 1)*249 + 1) obs = %eval(min(249,&num_of_cols. - &_i. * 249)) ;
quit;

proc sql; 
    create table split_&_i. as
    select YOUR_PRIMARY_KEY, &vars from YOUR_ORIGINAL_TABLE;
quit;
%end;

Hopefully this gives you another idea. The solution is not tested, and may contain some pseudocode elements as it's written from my memory of doing things. Also this is void of macro declaration and much of parametrization one could do.. This would make the solution more general (parametrize your number of variables for each dataset, your primary key name, and your dataset names for example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM