[英]about data merge: the in= option in SAS
I'm a little confused by what the following in= step does. 以下in =步骤的作用让我有些困惑。
Here is the code: 这是代码:
data data1;
merge data2 data3 (in=inb);
by ID;
if inb;
run;
I would really appreciate if someone can tell me what the in=inb here does. 如果有人可以告诉我in = inb的功能,我将不胜感激。
DS_A DS_B
ID VAR1 ID VAR2
A X A X
B X B X
C X D X
data want;
merge ds_a ds_b;
by id;
run;
will produce this 会产生这个
WANT:
ID VAR1 VAR2
A X X
B X X
C X
D X
If you add the IN= option you add a temporary and hidden variable that is 1 when the observation is present in that dataset, 0 otherwise, like this: 如果添加IN =选项,则在该数据集中存在观察值时添加一个临时和隐藏变量,该变量为1,否则为0,如下所示:
DS_A DS_B
ID VAR1 ID VAR2
A X A X
B X B X
C X D X
data want;
merge ds_a (in=frs) ds_b (in=scn);
by id;
run;
WANT:
ID VAR1 VAR2 FRS SCN
A X X 1 1
B X X 1 1
C X 1 0
D X 0 1
So you can play with this hidden variable to keep observations from one dataset or from both or from only one etc... 因此,您可以使用此隐藏变量来保留来自一个数据集或来自两个数据集或仅来自一个数据集的观察值...
if frs; ---> keep ID=A B C
if scn; ---> keep ID=A B D
if frs and scn ---> keep ID=A B
if frs and not scn --> keep ID=C
etc..
One other aspect of the behaviour of the in=
option that I don't think anyone else has mentioned - if you merge two different datasets using the same in=
variable for both, and a row is in one but not the other, a value of 1 takes precedence over a value of 0. Eg 我认为其他人都没有提到过in=
选项的行为的另一个方面-如果您使用两个相同的in=
变量合并两个不同的数据集,并且其中一个行位于另一个行中,则不在一个行中1的优先级高于0的值。例如
data test;
merge sashelp.class(where = (sex = 'F') in = a)
sashelp.class(where = (sex = 'M') in = a);
by name;
put _all_;
run;
In this case, a = 1 for every row, even though each row is only present in one of the input datasets. 在这种情况下,即使每行仅出现在一个输入数据集中,每行的a = 1。
According to SYNTAX
section of the merge
Statement documentation , the data sets you are merging can have options. 根据merge
Statement文档的SYNTAX
部分,要merge
的数据集可以具有选项。 In this case you are using IN= Data Set Option . 在这种情况下,您将使用IN = Data Set Option 。 Below is the explanation of this option: 以下是此选项的说明:
Creates a Boolean variable that indicates whether the data set contributed data to the current observation. 创建一个布尔变量,该变量指示数据集是否为当前观测值贡献了数据。
So in this case, you are naming this boolean variable inb
. 因此,在这种情况下,您将命名此布尔变量inb
。
because the option (in=inb) is after data3, it is referred to this dataset. 因为选项(in = inb)在data3之后,所以它被引用到该数据集。 hence, you will have a boolean variable that will be 1 in data1 (final dataset) if that observation was present in data3, 0 otherwise. 因此,如果data3中存在该观察值,则在data1(最终数据集)中将有一个布尔变量为1,否则为0。
Data2 Data3
ID ID
A A
B B
C D
You will have 您将拥有
Data3
ID INB
A 1
B 1
C 0
D 1
Adding the statement if INB; 如果是INB则添加语句; you will keep only observations with INB=1 (observations coming from data3) 您将仅保留INB = 1的观测值(观测值来自data3)
Data3
ID
A
B
D
Functionally, 在功能上,
merge data2 data3 (in=inb);
by ID;
if inb;
is the same as a right join in SQL. 与SQL中的右连接相同。
Technically, "inb" is a 0/1 flag set to "1" for each record found in data3. 从技术上讲,“ inb”是将data3中找到的每个记录的0/1标志设置为“ 1”。 "if inb" is shorthand for "if inb is true [then keep the record]", and for numeric fields "true" means greater than zero. “ if inb”是“如果inb为true [则保留记录]”的简写,对于数字字段,“ true”表示大于零。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.