简体   繁体   中英

Load a stata .dta file in R and do data analysis

The main issue is that the data is loaded with factors, but if I turn factors off, all the numbers change. The dataset is here https://www.dropbox.com/s/71se6zo5ucqki8v/yrbs2013.dta?dl=0

I cannot do data analysis on this because of the "years old" at the end. However, in stata, it seems to ignore the "years old" output at the end and data is very easy to manipulate. My question: how do I turn these factor based text variables from "14 years old" to the numerical value of "14" so I can do data analysis

library(foreign)
yrbs=read.dta('yrbs2013.dta',convert.factors = T)
head(yrbs$Q1)
[1] 14 years old 14 years old 15 years old 15 years old 15 years old 15 years old
7 Levels: 12 years old or younger 13 years old 14 years old ... 18 years old or older

Here is the output with factors off. All the numbers have been recoded, and taking a mean of that would produce meaningless results.

yrbs=read.dta('yrbs2013.dta',convert.factors = F)
head(yrbs$Q1)
[1] 3 3 4 4 4 4   

I have also tried to convert the dataset into a csv and the same issue appears. I am trying to avoid the complicated regex splitting and the running as.numeric() , as I do not want to do that for the entire dataset.

You can simply read the fields as text and work with them like so:

yrbs <- read.dta('yrbs2013.dta')
yrbs$Q1 <- with(yrbs, as.integer(gsub("[^0-9]", "", Q1)))

> with(yrbs, table(Q1))
Q1
  12   13   14   15   16   17   18 
  26   18 1368 3098 3203 3473 2320 

Note that this constitutes a loss of information -- actually the values 12 and 18 were originally "12 years old or younger" and "18 years old or older", respectively. Not sure that's what you want to do.

I can't seem to reproduce what you're describing in Stata. Opening the file in Stata shows that Stata has only two representations of this variable

  • as labels
  • as integer values 1:7

To convince yourself of this, in Stata try typing the following

generate Q1n = Q1 +0

It does not seem like Stata is actually storing the variable as 12:18 anywhere--It's possible that Stata may have truncated the labels in a way that seemed as if this variable was stored as 12:18.

I would bet that the only possible sort of approach is demonstrated in MichaelChirico's answer.

I'd reiterate that this variable is not numeric: it is ordered categorical since the 18 category is really >=18 and the 12 category is really <=12 This may or may not be an issue, but you should be aware that you're coercing an ordered categorical to a numeric variable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM