Reading in only part of a Stata .DTA file in R

Question

I apologize in advance if this has a simple answer somewhere. It seems like the kind of thing that would, but I can't seem to locate it in the help files, by searching SO, or by Googling.

I'm working with some datasets that are several GB right now. It's enough to fit in memory on one of the cluster nodes I have access to, but takes quite a bit of time to load. For many debugging/programming activities with this data, I don't need the entire file loaded, just the first few thousand observations to have a dataset on which to test code. I can of course just read the whole file in and subset, but I was wondering if there's a way to tell read.dta() to only read in the first N rows? This would of course be much faster.

I could also use a proper format like .csv and then use read.csv() 's nrows argument, but then I'd lose the factor labels in the Stata dataset (and have to recreate quite a few GB of data from someone else's code that's feeding in to this project. So a direct solution on .dta files is preferred.

Answer 1

塔塔的二进制文件被写入一行一行地，所以你可以改变R_LoadStataData在功能上stataread.c限制读取的行数。但是，如果你是因为他们在书面不需要值标签这只会工作文件的结尾，并要求您读取整个文件 - 这不会节省任何时间。

Answer 2

That's going to be a difficult one, as the do_readStata function under the hood is compiled code, only capable of taking in the whole file. I believe that in general binary files are hard to read line by line, and .dta is a binary format. Also the native binary format of R doesn't allow to select a number of lines from the dataset while reading in.

In my humble opinion, you can better just create a set of test files from within Stata ( eg the Stata code sample 1000, count will give you a sample of 1000 observations from the loaded dataset), and work with them. And if you have no access to Stata, someone else in the project should be able to do that for you.

Answer 3

To follow up on Joris Meys: For this kind of thing, I use a "test" data set and the "real" data set, each in separate folders. I keep a macro at the top of the .do file (with if/then statements below) to (1) take a sample of the data and (2) point input/output to the right folder containing one or the other. I probably do it different for every project, but something like this:

data creation .do file

blah blah blah 
save                  using data/myfile.dta
save if uniform()<.05 using test_data/myfile.dta   // or bsample, then save for panel data

analysis .do file

local test = "test_"   
// when you're ready to run the file with all the data, use the following 
// local test = ""

use `test'data/myfile.dta
blah blah blah 
outreg2 ... using `test'output/mytable.txt

Reading in only part of a Stata .DTA file in R

Question

3 answers

solution1
9 ACCPTED 2011-04-11 13:45:33

solution2
7 2011-04-11 13:12:18

solution3
1 2011-04-12 13:37:26

Reading in only part of a Stata .DTA file in R

Question

3 answers

solution1 9 ACCPTED 2011-04-11 13:45:33

solution2 7 2011-04-11 13:12:18

solution3 1 2011-04-12 13:37:26

solution1
9 ACCPTED 2011-04-11 13:45:33

solution2
7 2011-04-11 13:12:18

solution3
1 2011-04-12 13:37:26