简体   繁体   English

使用主键按列拆分SAS数据集

[英]Split SAS datasets by column with primary key

So I have a dataset with one primary key: unique_id and 1200 variables. 因此,我有一个具有一个主键的数据集: unique_id和1200个变量。 This dataset is generated from a macro so the number of columns will not be fixed. 该数据集是从宏生成的,因此列数不会固定。 I need to split this dataset into 4 or more datasets of 250 variables each, and each of these smaller datasets should contain the primary key so that I can merge them back later. 我需要将此数据集拆分为4个或更多的250个变量的数据集,这些较小的数据集均应包含主键,以便以后可以将其合并。 Can somebody help me with either a sas function or a macro to solve this? 有人可以通过sas函数或宏来帮助我解决此问题吗? Thanks in advance. 提前致谢。

A simple way to split a datasets in the way you request is to use a single data step with multiple output datasets where each one has a KEEP= dataset option listing the variables to keep. 按请求方式拆分数据集的一种简单方法是对多个输出数据集使用单个数据步骤,每个数据集都有一个KEEP =数据集选项,列出要保留的变量。 For example: 例如:

data split1(keep=Name Age Height) split2(keep=Name Sex Weight);
  set sashelp.class;
run;

So you need to get the list of variables and group then into sets of 250 or less. 因此,您需要获取变量列表,然后将其分组为250个或更少的集合。 Then you can use those groupings to generate code like above. 然后,您可以使用这些分组来生成上述代码。 Here is one method using PROC CONTENTS to get the list of variables and CALL EXECUTE() to generate the code. 这是一种使用PROC CONTENTS获取变量列表并使用CALL EXECUTE()生成代码的方法。

I will use macro variables to hold the name of the input dataset, the key variable that needs to be kept on each dataset and maximum number of variables to keep in each dataset. 我将使用宏变量保存输入数据集的名称,每个数据集需要保留的关键变量以及每个数据集中保留的最大变量数。

So for the example above those macro variable values would be: 因此,对于上面的示例,这些宏变量值将是:

%let ds=sashelp.class;
%let key=name;
%let nvars=2;

So use PROC CONTENTS to get the list of variable names: 因此,使用PROC CONTENTS来获取变量名称列表:

proc contents data=&ds noprint out=contents; run;

Now run a data step to split them into groups and generate a member name to use for the new split dataset. 现在运行一个数据步骤,将它们拆分为组,并生成一个成员名称以用于新的拆分数据集。 Make sure not to include the KEY variable in the list of variables when counting. 计数时,请确保不要在变量列表中包括KEY变量。

data groups;
  length group 8 memname $41 varnum 8 name $32 ;
  group +1;
  memname=cats('split',group);
  do varnum=1 to &nvars while (not eof);
    set contents(keep=name where=(upcase(name) ne %upcase("&key"))) end=eof;
    output;
  end;
run;

Now you can use that dataset to drive the generation of the code: 现在,您可以使用该数据集来驱动代码的生成:

data _null_;
  set groups end=eof;
  by group;
  if _n_=1 then call execute('data ');
  if first.group then call execute(cats(memname,'(keep=&key'));
  call execute(' '||trim(name));
  if last.group then call execute(') ');
  if eof then call execute(';set &ds;run;');
run;

Here are results from the SAS log: 以下是SAS日志的结果:

NOTE: CALL EXECUTE generated line.
1    + data
2    + split1(keep=name
3    +  Age
4    +  Height
5    + )
6    + split2(keep=name
7    +  Sex
8    +  Weight
9    + )
10   + ;set sashelp.class;run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.SPLIT1 has 19 observations and 3 variables.
NOTE: The data set WORK.SPLIT2 has 19 observations and 3 variables.

Just another way of doing it using macro variables: 使用宏变量的另一种方法是:

/* Number of columns you want in each chunk */

%let vars_per_part = 250;

/* Get all the column names into a dataset */

proc contents data = have out=cols noprint;
run;

%macro split(part);

  /* Split the columns into 250 chunks for each part and put it into a macro variable */

  %let fobs = %eval((&part - 1)* &vars_per_part + 1);
  %let obs  = %eval(&part * &vars_per_part);
  proc sql noprint;
    select name into :cols separated by " " from cols (firstobs =  &fobs obs = &obs) where name ~= "uniq_id";
    quit;

  /* Chunk up the data only keeping those varaibles and the uniq_id */
  data want_part∂
    set have (keep = &cols uniq_id);
  run;

%mend;

/* Run this from 1 to whatever the increment required to cover all the columnns */

%split(1);
%split(2);
%split(3);

this is not a complete solution but some help to give you another insight into how to solve this. 这不是一个完整的解决方案,但是可以帮助您进一步了解如何解决此问题。 The previous solutions have relied much on proc contents and data step, but I would solve this using proc sql and dictionary.columns . 先前的解决方案在很大程度上依赖于proc的内容和数据步骤,但是我将使用proc sqldictionary.columns解决此问题。 And I would create a macro that would split the original file into as many parts as needed, 250 cols each. 然后,我将创建一个宏,该宏将根据需要将原始文件分成多个部分,每个部分250列。 The steps roughly: 步骤大致如下:

proc sql; create table as _colstemp as select * from dictionary.columns where library='your library' and memname = 'your table' and name ne 'your primary key'; quit;

Count the number of files needed somewhere along: 计算以下位置所需的文件数:

proc sql; 
    select ceil(count(*)/249) into :num_of_datasets from _colstemp; 
    select count(*) into :num_of_cols from _colstemp; 
quit;

Then just loop over the original dataset like: 然后只需遍历原始数据集即可:

%do &_i = 1 %to &num_of_datasets

proc sql; 
    select name into :vars separated by ',' 
    from _colstemp(firstobs=%eval((&_i. - 1)*249 + 1) obs = %eval(min(249,&num_of_cols. - &_i. * 249)) ;
quit;

proc sql; 
    create table split_&_i. as
    select YOUR_PRIMARY_KEY, &vars from YOUR_ORIGINAL_TABLE;
quit;
%end;

Hopefully this gives you another idea. 希望这给您另一个想法。 The solution is not tested, and may contain some pseudocode elements as it's written from my memory of doing things. 该解决方案未经测试,可能包含一些伪代码元素,因为它是从我做事的记忆中编写的。 Also this is void of macro declaration and much of parametrization one could do.. This would make the solution more general (parametrize your number of variables for each dataset, your primary key name, and your dataset names for example. 同样,这也没有宏声明,也可以做很多参数化操作。这将使解决方案更加通用(例如,对每个数据集的变量数量,主键名称和数据集名称进行参数化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM