[英]Error reading Stata data in R
I am trying to read a Stata dataset in R with the foreign
package, but when I try to read the file using: 我想读R A Stata的数据集与
foreign
包,但是当我尝试读取使用的文件:
library(foreign)
data <- read.dta("data.dta")
I got the following error: 我收到以下错误:
Error in read.dta("data.dta") : a binary read error occurred
The file works fine in Stata. 该文件在Stata中运行良好。 This site suggests saving the file in Stata without labels and then reading it into R. With this workaround I am able to load the file into R, but then I lose the labels.
此站点建议将文件保存在没有标签的Stata中,然后将其读入R.使用此解决方法,我可以将文件加载到R中,但随后我丢失了标签。 Why am I getting this error and how can I read the file into R with the labels?
为什么我会收到此错误,如何使用标签将文件读入R? Another person finds that they get this error when they have variables with no values.
另一个人发现当他们有没有值的变量时他们会得到这个错误。 My data do have at least one or two such variables, but I have no easy way to identify those variables in stata.
我的数据确实至少有一两个这样的变量,但我没有简单的方法来识别stata中的那些变量。 It is a very large file with thousands of variables.
它是一个包含数千个变量的非常大的文件。
You should call library(foreign)
before reading the Stata data. 在读取Stata数据之前,您应该调用
library(foreign)
。
library(foreign)
data <- read.dta("data.dta")
Updates: As mentioned here , 更新:如前所述在这里 ,
"The error message implies that the file was found, and that it started with the right sequence of bytes to be a Stata .dta file, but that something (probably the end of the file) prevented R from reading what it was expecting to read. " “错误消息暗示文件已找到,并且它以正确的字节序列开始为Stata .dta文件,但是某些内容(可能是文件的末尾)阻止R读取它期望读取的内容。“
But, we might be just guessing without any further information. 但是,我们可能只是猜测没有任何进一步的信息。
Update to OP's question and answer: 更新OP的问题和答案:
I have tried whether that is the case using auto data from Stata, but its not.So, there should be other reasons: 我已经尝试过使用Stata的自动数据是否是这种情况,但不是。所以,应该有其他原因:
*Claims 1 and 2: if there is missings in variable or there is dataset with labels, R read.dta
will generate the error * *声明1和2:如果变量中存在缺失或存在带标签的数据集,则R
read.dta
将生成错误*
sysuse auto #this dataset has labels
replace mpg=. #generates missing for mpg variable
br in 1/10
make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
AMC Concord 4099 3 2.5 11 2930 186 40 121 3.58 Domestic
AMC Pacer 4749 3 3.0 11 3350 173 40 258 2.53 Domestic
AMC Spirit 3799 3.0 12 2640 168 35 121 3.08 Domestic
Buick Century 4816 3 4.5 16 3250 196 40 196 2.93 Domestic
Buick Electra 7827 4 4.0 20 4080 222 43 350 2.41 Domestic
Buick LeSabre 5788 3 4.0 21 3670 218 43 231 2.73 Domestic
Buick Opel 4453 3.0 10 2230 170 34 304 2.87 Domestic
Buick Regal 5189 3 2.0 16 3280 200 42 196 2.93 Domestic
Buick Riviera 10372 3 3.5 17 3880 207 43 231 2.93 Domestic
Buick Skylark 4082 3 3.5 13 3400 200 42 231 3.08 Domestic
save "~myauto"
de(myauto)
Contains data from ~\myauto.dta
obs: 74 1978 Automobile Data
vars: 12 25 Aug 2013 11:32
size: 3,478 (99.9% of memory free) (_dta has notes)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
make str18 %-18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: foreign
library(foreign)
myauto<-read.dta("myauto.dta") #works perfect
str(myauto)
'data.frame': 74 obs. of 12 variables:
$ make : chr "AMC Concord" "AMC Pacer" "AMC Spirit" "Buick Century" ...
$ price : int 4099 4749 3799 4816 7827 5788 4453 5189 10372 4082 ...
$ mpg : int NA NA NA NA NA NA NA NA NA NA ...
$ rep78 : int 3 3 NA 3 4 3 NA 3 3 3 ...
$ headroom : num 2.5 3 3 4.5 4 4 3 2 3.5 3.5 ...
$ trunk : int 11 11 12 16 20 21 10 16 17 13 ...
$ weight : int 2930 3350 2640 3250 4080 3670 2230 3280 3880 3400 ...
$ length : int 186 173 168 196 222 218 170 200 207 200 ...
$ turn : int 40 40 35 40 43 43 34 42 43 42 ...
$ displacement: int 121 258 121 196 350 231 304 196 231 231 ...
$ gear_ratio : num 3.58 2.53 3.08 2.93 2.41 ...
$ foreign : Factor w/ 2 levels "Domestic","Foreign": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "datalabel")= chr "1978 Automobile Data"
- attr(*, "time.stamp")= chr "25 Aug 2013 11:23"
- attr(*, "formats")= chr "%-18s" "%8.0gc" "%8.0g" "%8.0g" ...
- attr(*, "types")= int 18 252 252 252 254 252 252 252 252 252 ...
- attr(*, "val.labels")= chr "" "" "" "" ...
- attr(*, "var.labels")= chr "Make and Model" "Price" "Mileage (mpg)" "Repair Record 1978" ...
- attr(*, "expansion.fields")=List of 2
..$ : chr "_dta" "note1" "from Consumer Reports with permission"
..$ : chr "_dta" "note0" "1"
- attr(*, "version")= int 12
- attr(*, "label.table")=List of 1
..$ origin: Named int 0 1
.. ..- attr(*, "names")= chr "Domestic" "Foreign"
Here's a solver list. 这是一个求解器列表。 My guess is that the first item has a 75% likelihood to solve your issue.
我的猜测是第一项有75%的可能性来解决你的问题。
dta
file with saveold
, and try again. dta
与文件saveold
,然后再试一次。 read.dta
function. read.dta
函数。 A more thorough description of the dataset would be required to work past that point. 在此之后需要对数据集进行更全面的描述。 The issue seems fixable, I've never had much trouble using
foreign
with tons of Stata files. 这个问题似乎是可以修复的,我从来没有遇到过使用大量Stata文件的
foreign
问题。
You might also give a try to the Stata.file
function in the memisc
package to see if that fails too. 您也可以尝试一下
memisc
包中的Stata.file
函数,看看它是否也失败了。
I do not know why this occurs and would be interested if anyone could explain, but read.dta
indeed cannot handle variables that are all NA. 我不知道为什么会发生这种情况,如果有人能够解释会感兴趣,但
read.dta
确实无法处理全部NA的变量。 A solution is to delete such variables in Stata with the following code : 解决方案是使用以下代码在Stata中删除此类变量:
foreach varname of varlist * {
quietly sum `varname'
if `r(N)'==0 {
drop `varname'
disp "dropped `varname' for too much missing data"
}
}
It's been a lot of time, but I solved this same problem exporting the .dta data to .csv. 这是一段很长的时间,但我解决了将.dta数据导出到.csv的同样问题。 The problem was related to the labels of the factor variables, especially because the labels were in Spanish and the ASCII encoding is a mess.
问题与因子变量的标签有关,特别是因为标签是西班牙语并且ASCII编码是一团糟。 I hope this work for someone with the same problem and with Stata software.
我希望这对有相同问题的人和Stata软件有用。
In stata: 在stata:
export delimited using "/Users/data.csv", nolabel replace
In R: 在R:
df <- read.csv("lapop2014.csv")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.