简体   繁体   English

只读取R中Stata .DTA文件的一部分

[英]Reading in only part of a Stata .DTA file in R

I apologize in advance if this has a simple answer somewhere. 如果在某处有一个简单的答案,我会提前道歉。 It seems like the kind of thing that would, but I can't seem to locate it in the help files, by searching SO, or by Googling. 这似乎是那样的事情,但我似乎无法通过搜索SO或谷歌搜索在帮助文件中找到它。

I'm working with some datasets that are several GB right now. 我现在正在使用几个GB的数据集。 It's enough to fit in memory on one of the cluster nodes I have access to, but takes quite a bit of time to load. 它足以适应我有权访问的其中一个集群节点的内存,但需要花费相当多的时间来加载。 For many debugging/programming activities with this data, I don't need the entire file loaded, just the first few thousand observations to have a dataset on which to test code. 对于使用此数据的许多调试/编程活动,我不需要加载整个文件,只需要在前几千个观察点上有一个数据集来测试代码。 I can of course just read the whole file in and subset, but I was wondering if there's a way to tell read.dta() to only read in the first N rows? 我当然可以读取整个文件和子集,但我想知道是否有办法告诉read.dta()只读取前N行? This would of course be much faster. 这当然要快得多。

I could also use a proper format like .csv and then use read.csv() 's nrows argument, but then I'd lose the factor labels in the Stata dataset (and have to recreate quite a few GB of data from someone else's code that's feeding in to this project. So a direct solution on .dta files is preferred. 我也可以使用像.csv这样的正确格式,然后使用read.csv()的nrows参数,但后来我在Stata数据集中丢失了因子标签(并且必须从别人的数据库中重新创建相当多的GB数据)代码正在为这个项目提供支持。因此,首选.dta文件的直接解决方案。

塔塔的二进制文件被写入一行一行地,所以你可以改变R_LoadStataData在功能上stataread.c限制读取的行数。但是,如果你是因为他们在书面不需要值标签这只会工作文件的结尾,并要求您读取整个文件 - 这不会节省任何时间。

That's going to be a difficult one, as the do_readStata function under the hood is compiled code, only capable of taking in the whole file. 这将是一个困难的,因为do_readStata下的do_readStata函数是编译代码,只能接收整个文件。 I believe that in general binary files are hard to read line by line, and .dta is a binary format. 我相信一般二进制文件很难逐行读取,而.dta是二进制格式。 Also the native binary format of R doesn't allow to select a number of lines from the dataset while reading in. 此外,R的本机二进制格式不允许在读入时从数据集中选择多行。

In my humble opinion, you can better just create a set of test files from within Stata ( eg the Stata code sample 1000, count will give you a sample of 1000 observations from the loaded dataset), and work with them. 在我看来,您可以更好地从Stata中创建一组测试文件(例如,Stata代码sample 1000, count将为您提供来自已加载数据集的1000个观测值的样本),并使用它们。 And if you have no access to Stata, someone else in the project should be able to do that for you. 如果您无法访问Stata,项目中的其他人应该能够为您执行此操作。

To follow up on Joris Meys: For this kind of thing, I use a "test" data set and the "real" data set, each in separate folders. 跟进Joris Meys:对于这种事情,我使用“测试”数据集和“真实”数据集,每个数据集都在不同的文件夹中。 I keep a macro at the top of the .do file (with if/then statements below) to (1) take a sample of the data and (2) point input/output to the right folder containing one or the other. 我在.do文件的顶部(使用下面的if / then语句)保留一个宏,以(1)获取数据样本,(2)将输入/输出指向包含其中一个或另一个的右侧文件夹。 I probably do it different for every project, but something like this: 我可能会为每个项目做不同的事情,但是这样的事情:

data creation .do file 数据创建.do文件

blah blah blah 
save                  using data/myfile.dta
save if uniform()<.05 using test_data/myfile.dta   // or bsample, then save for panel data

analysis .do file 分析.do文件

local test = "test_"   
// when you're ready to run the file with all the data, use the following 
// local test = ""

use `test'data/myfile.dta
blah blah blah 
outreg2 ... using `test'output/mytable.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM