简体   繁体   English

读取R中的Stata数据时出错

[英]Error reading Stata data in R

I am trying to read a Stata dataset in R with the foreign package, but when I try to read the file using: 我想读R A Stata的数据集与foreign包,但是当我尝试读取使用的文件:

library(foreign)
data <- read.dta("data.dta")

I got the following error: 我收到以下错误:

Error in read.dta("data.dta") : a binary read error occurred

The file works fine in Stata. 该文件在Stata中运行良好。 This site suggests saving the file in Stata without labels and then reading it into R. With this workaround I am able to load the file into R, but then I lose the labels. 站点建议将文件保存在没有标签的Stata中,然后将其读入R.使用此解决方法,我可以将文件加载到R中,但随后我丢失了标签。 Why am I getting this error and how can I read the file into R with the labels? 为什么我会收到此错误,如何使用标签将文件读入R? Another person finds that they get this error when they have variables with no values. 另一个发现当他们有没有值的变量时他们会得到这个错误。 My data do have at least one or two such variables, but I have no easy way to identify those variables in stata. 我的数据确实至少有一两个这样的变量,但我没有简单的方法来识别stata中的那些变量。 It is a very large file with thousands of variables. 它是一个包含数千个变量的非常大的文件。

You should call library(foreign) before reading the Stata data. 在读取Stata数据之前,您应该调用library(foreign)

library(foreign)
data <- read.dta("data.dta")

Updates: As mentioned here , 更新:如前所述在这里

"The error message implies that the file was found, and that it started with the right sequence of bytes to be a Stata .dta file, but that something (probably the end of the file) prevented R from reading what it was expecting to read. " “错误消息暗示文件已找到,并且它以正确的字节序列开始为Stata .dta文件,但是某些内容(可能是文件的末尾)阻止R读取它期望读取的内容。“

But, we might be just guessing without any further information. 但是,我们可能只是猜测没有任何进一步的信息。

Update to OP's question and answer: 更新OP的问题和答案:

I have tried whether that is the case using auto data from Stata, but its not.So, there should be other reasons: 我已经尝试过使用Stata的自动数据是否是这种情况,但不是。所以,应该有其他原因:

*Claims 1 and 2: if there is missings in variable or there is dataset with labels, R read.dta will generate the error * *声明1和2:如果变量中存在缺失或存在带标签的数据集,则R read.dta将生成错误*

sysuse auto #this dataset has labels
replace mpg=. #generates missing for mpg variable
br in 1/10
make    price   mpg rep78   headroom    trunk   weight  length  turn    displacement    gear_ratio  foreign
AMC Concord 4099        3   2.5 11  2930    186 40  121 3.58    Domestic
AMC Pacer   4749        3   3.0 11  3350    173 40  258 2.53    Domestic
AMC Spirit  3799            3.0 12  2640    168 35  121 3.08    Domestic
Buick Century   4816        3   4.5 16  3250    196 40  196 2.93    Domestic
Buick Electra   7827        4   4.0 20  4080    222 43  350 2.41    Domestic
Buick LeSabre   5788        3   4.0 21  3670    218 43  231 2.73    Domestic
Buick Opel  4453            3.0 10  2230    170 34  304 2.87    Domestic
Buick Regal 5189        3   2.0 16  3280    200 42  196 2.93    Domestic
Buick Riviera   10372       3   3.5 17  3880    207 43  231 2.93    Domestic
Buick Skylark   4082        3   3.5 13  3400    200 42  231 3.08    Domestic

save "~myauto"
de(myauto)

Contains data from ~\myauto.dta
  obs:            74                          1978 Automobile Data
 vars:            12                          25 Aug 2013 11:32
 size:         3,478 (99.9% of memory free)   (_dta has notes)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
rep78           int    %8.0g                  Repair Record 1978
headroom        float  %6.1f                  Headroom (in.)
trunk           int    %8.0g                  Trunk space (cu. ft.)
weight          int    %8.0gc                 Weight (lbs.)
length          int    %8.0g                  Length (in.)
turn            int    %8.0g                  Turn Circle (ft.)
displacement    int    %8.0g                  Displacement (cu. in.)
gear_ratio      float  %6.2f                  Gear Ratio
foreign         byte   %8.0g       origin     Car type
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:  foreign


library(foreign)
myauto<-read.dta("myauto.dta")  #works perfect
    str(myauto)
'data.frame':   74 obs. of  12 variables:
 $ make        : chr  "AMC Concord" "AMC Pacer" "AMC Spirit" "Buick Century" ...
 $ price       : int  4099 4749 3799 4816 7827 5788 4453 5189 10372 4082 ...
 $ mpg         : int  NA NA NA NA NA NA NA NA NA NA ...
 $ rep78       : int  3 3 NA 3 4 3 NA 3 3 3 ...
 $ headroom    : num  2.5 3 3 4.5 4 4 3 2 3.5 3.5 ...
 $ trunk       : int  11 11 12 16 20 21 10 16 17 13 ...
 $ weight      : int  2930 3350 2640 3250 4080 3670 2230 3280 3880 3400 ...
 $ length      : int  186 173 168 196 222 218 170 200 207 200 ...
 $ turn        : int  40 40 35 40 43 43 34 42 43 42 ...
 $ displacement: int  121 258 121 196 350 231 304 196 231 231 ...
 $ gear_ratio  : num  3.58 2.53 3.08 2.93 2.41 ...
 $ foreign     : Factor w/ 2 levels "Domestic","Foreign": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "datalabel")= chr "1978 Automobile Data"
 - attr(*, "time.stamp")= chr "25 Aug 2013 11:23"
 - attr(*, "formats")= chr  "%-18s" "%8.0gc" "%8.0g" "%8.0g" ...
 - attr(*, "types")= int  18 252 252 252 254 252 252 252 252 252 ...
 - attr(*, "val.labels")= chr  "" "" "" "" ...
 - attr(*, "var.labels")= chr  "Make and Model" "Price" "Mileage (mpg)" "Repair Record 1978" ...
 - attr(*, "expansion.fields")=List of 2
  ..$ : chr  "_dta" "note1" "from Consumer Reports with permission"
  ..$ : chr  "_dta" "note0" "1"
 - attr(*, "version")= int 12
 - attr(*, "label.table")=List of 1
  ..$ origin: Named int  0 1
  .. ..- attr(*, "names")= chr  "Domestic" "Foreign"

Here's a solver list. 这是一个求解器列表。 My guess is that the first item has a 75% likelihood to solve your issue. 我的猜测是第一项有75%的可能性来解决你的问题。

  1. In Stata, resave a fresh copy of your dta file with saveold , and try again. 在Stata,重新保存您的新副本dta与文件saveold ,然后再试一次。
  2. If that fails, provide a sample to show what kind of values kill the read.dta function. 如果失败,请提供一个示例以显示哪种值会杀死read.dta函数。
  3. If missing values are to blame, run the loop from the other answer. 如果要归咎于缺失值,请从另一个答案运行循环。

A more thorough description of the dataset would be required to work past that point. 在此之后需要对数据集进行更全面的描述。 The issue seems fixable, I've never had much trouble using foreign with tons of Stata files. 这个问题似乎是可以修复的,我从来没有遇到过使用大量Stata文件的foreign问题。

You might also give a try to the Stata.file function in the memisc package to see if that fails too. 您也可以尝试一下memisc包中的Stata.file函数,看看它是否也失败了。

I do not know why this occurs and would be interested if anyone could explain, but read.dta indeed cannot handle variables that are all NA. 我不知道为什么会发生这种情况,如果有人能够解释会感兴趣,但read.dta确实无法处理全部NA的变量。 A solution is to delete such variables in Stata with the following code : 解决方案是使用以下代码在Stata中删除此类变量:

foreach varname of varlist * {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"
 }
}

It's been a lot of time, but I solved this same problem exporting the .dta data to .csv. 这是一段很长的时间,但我解决了将.dta数据导出到.csv的同样问题。 The problem was related to the labels of the factor variables, especially because the labels were in Spanish and the ASCII encoding is a mess. 问题与因子变量的标签有关,特别是因为标签是西班牙语并且ASCII编码是一团糟。 I hope this work for someone with the same problem and with Stata software. 我希望这对有相同问题的人和Stata软件有用。

In stata: 在stata:

export delimited using "/Users/data.csv", nolabel replace

In R: 在R:

df <- read.csv("lapop2014.csv")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM