简体   繁体   English

在MERGE和SET之间选择以在SAS中合并数据

[英]Choosing between MERGE and SET for combining data in SAS

I have a general question on methodology. 我对方法学有一个一般性的问题。 How do I know whether match-merging (MERGE) or interleaving (SET) is better for combining datasets? 我怎么知道匹配合并(MERGE)或交织(SET)是否更适合合并数据集? If I have two related datasets, that seem to contain many of the same variables (but not all), but I don't know whether or not the information in said variables is the same, which is better? 如果我有两个相关的数据集,它们似乎包含许多相同的变量(但不是全部),但是我不知道所述变量中的信息是否相同,哪个更好?

Is there some sort of general rule of deciding which is better? 是否存在某种确定更好的一般规则?

Thanks for your advice. 谢谢你的建议。

There really isn't a good answer to this question; 这个问题确实没有很好的答案。 there are fundamental differences between what "merging" and "interleaving" do. “合并”和“交错”之间有根本的区别。 Take a few minutes and read the example in the SAS Concepts manual, particularly here . 花几分钟时间,阅读SAS Concepts手册中的示例, 尤其是在此处

I think that's a question that is very much specific to your data and what you are trying to achieve. 我认为这是一个非常具体的问题,专门针对您的数据以及您要实现的目标。 You shouldn't combine the datasets at all until you know enough about the data to know whether or not you can combine them (set) or want to match-merge them. 除非对数据足够了解,否则不应该合并数据集,以了解是否可以合并(设置)数据集或想匹配合并它们。 There cannot be a general rule because it simply depends on your data - if I had two datasets 不能有一个通用规则,因为它仅取决于您的数据-如果我有两个数据集

data have_1;
input x y;
datalines;
1 2
2 3
3 4
;;;;
run;

data have_2;
input x y z;
datalines;
1 2 3
2 3 4
3 4 5 
;;;;
run;

You could guess that have_1 and have_2 are the same observations, just with an additional variable z; 您可能会猜测have_1和have_2是相同的观察值,只是带有一个附加变量z; but they easily could be different observations as well. 但是它们很容易成为不同的观察结果。 If I told you that 'x' was the unique identifier, then you would suspect these are the same records; 如果我告诉您“ x”是唯一标识符,那么您可能会怀疑它们是相同的记录。 but if I told you that 'x' and 'y' were qualitative features, then they could easily be different observations that happen to be similar qualitatively. 但是,如果我告诉您“ x”和“ y”是定性特征,那么它们很可能是质素相似的不同观察结果。

The point here: know your data before doing anything with it. 这里的重点是:在执行任何操作之前先了解您的数据。 If you don't know your data you shouldn't be working with it in the first place. 如果您不知道自己的数据,则一开始就不应该使用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM