简体   繁体   中英

Using memisc to import stata .dta file into R

I have a 700mb .dta Stata file with 28 million observations and 14 column variables

When I attempt to import into R using foreign's read.dta() function I run out of RAM on my 8GB machine (page outs shoot into GBs very quickly).

staph <- read.dta("Staph_1999_2010.dta")

I hunted around and it sounds like a more efficient alternative would be to use the Stata.file() function from the memisc package .

When I call:

staph <- Stata.file("Staph_1999_2010.dta")

I get a segfault:

*** caught segfault ***
address 0xd5d2b920, cause 'memory not mapped'

Traceback:
 1: .Call("dta_read_labels", bf, lbllen, padding)
 2: dta.read.labels(bf, len.lbl, 3)
 3: get.dictionary.dta(dta)
 4: Stata.file("Staph_1999_2010.dta")

I find the documentation for Stata.file() difficult to follow.

(1) Am I using Stata.file() correctly?

(2) Does Stata.file() return a dataframe like read.dta() does?

(3) If I'm using Stata.file() correctly, how can I fix the error I'm getting?

With access to Stata, one solution to export the .dta to .csv in Stata.

use "file.dta"

export delimited using "file.csv", replace

And then import in R using read.csv or data.table::fread .

Other ideas:

  • Consider sampling a bit of the data using sample in Stata Stata's
  • Stata compress attempts a lossless compression by changing types (not
    sure it would save much for the .csv and R though).
  • Pack the data tight by converting to integer any dates or string IDs if possible.
  • Use a cloud instance for a one time import, and initial cleansing, before sampling or keeping only the important part
  • Get more RAM...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM