[英]Recoding single variable spread across several in R
I am working with survey data that has a question about race. 我正在处理有关种族问题的调查数据。 Each race category is its own variable. 每个种族类别都是自己的变量。 Here is what I want to do: 这是我想要做的:
p.race
. 创建一个新变量p.race
。 p.race
the value of one of the eight variables for race/ethnicity (below). 为p.race
分配种族/种族的八个变量之一的值(下图)。 p.race
the value "Two or more races" in such cases. 在这种情况下,确定一个人是否标记了两个或更多种族,并将p.race
指定为“两个或更多个种族”。 p.race
the value "Hispanic or Latino" when they indicated this ethnicity. 当p.race
表示这种种族时,将其分配p.race
“西班牙裔或拉丁裔”。 p.poc
, to indicate if they are a person of color (ie, not white, including Hispanic/Latino). 创建一个新变量p.poc
,以指示它们是否是有色人物(即非白色,包括西班牙裔/拉丁裔)。 This shall be 0 or 1. 这应该是0或1。 The eight race categories are white*, black*, Asian*, AIAN*, NHPI*, some other race*, two or more races*, and Hispanic; 八个种族分类是白色*,黑色*,亚洲*,AIAN *,NHPI *,其他一些种族*,两个或更多种族*和西班牙裔; where * denotes not Hispanic or Latino ethnicity. 其中*表示不是西班牙裔或拉丁裔。
Here is what I tried so far for parsing out "Two or more races": 这是我到目前为止解析“两场或更多场比赛”的原因:
p['p.race'] <- NA # create new variable for race
# list of variable names that store a string indicating the race
## e.g., `race_white` would be either blank or contain "White, European, Middle Eastern, or Caucasian"
race.list <- c('p.race_white', 'p.race_black', 'p.race_asian', 'p.race_aian', 'p.race_nhpi', 'p.race_other')
# iterate through each record
for ( n in 1:length(p) ) {
multiflag = 0
# iterate through the race list
for ( i in race.list ) {
# if it is not blank, +1 to multiflag
if ( p$i[n] != '' ) {
multiflag <- multiflag + 1
}
}
# if multiflag was flagged more than once, assign "Two or more races" to `race`
if ( multiflag > 1 ) {
p$p.race[n] <- 'Two or more races'
}
}
When executed, it returns this error: 执行时,它返回此错误:
> Error in if (p$i[n] != "") { : argument is of length zero
And here is my poc
variable coding with error below: 这是我的poc
变量编码,错误如下:
p['p.poc'] <- 0 # create a new variable for whether they are a person of color
for ( n in 1:length(p) ) {
if ( p$p.race_black[n] == 'Black, African-American, or African'
| p$p.race_asian[n] == 'Asian or Asian-American'
| p$p.race_aian[n] == 'American Indian or Alaskan Native'
| p$p.race_nhpi[n] == 'Native Hawaiian or other Pacific Islander'
| p$p.race_other[n] == 'Other (please specify)'
| p$p.hispanic[n] == 'Yes') {
p$p.poc[n] <- 1
}
}
> Error in if (p$p.race_black[n] == "Black, African-American, or African" | :
missing value where TRUE/FALSE needed
I don't really know where to start for assigning the new race
variable one of the eight race categories without making it a very long code. 我真的不知道从哪里开始为新的race
变量分配八个种族类别中的一个而不会使它成为一个很长的代码。
If it helps, below are the survey questions: 如果有帮助,以下是调查问题:
Q1. Q1。 Do you consider yourself of Hispanic, Latino, or Spanish origin? 你认为自己是西班牙裔,拉丁裔或西班牙裔吗?
Q2. Q2。 Which race do you identify with (check all that apply)? 你确定哪个种族(检查所有适用的)?
And here is the sample output (text truncated): 这是示例输出(文本截断):
> p[264:271]
#
# p.hispanic p.race_white p.race_black p.race_asian p.race_aian p.race_nhpi p.race_other
# 1 Yes White
# 2 No White
# 3 No Black
# 4 No White Asian
# 5 Yes Some other race
And here is a dput
output: 这里是一个dput
输出:
> dput(p[264:270])
structure(list(p.hispanic = structure(c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "No", "Yes"
), class = "factor"), p.race_white = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L,
2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("",
"White, European, Middle Eastern, or Caucasian"), class = "factor"),
p.race_black = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"Black, African-American, or African"), class = "factor"),
p.race_asian = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("",
"Asian or Asian-American"), class = "factor"), p.race_aian = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("", "American Indian or Alaskan Native"
), class = "factor"), p.race_nhpi = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
p.race_other = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"Other (please specify)"), class = "factor")), .Names = c("p.hispanic",
"p.race_white", "p.race_black", "p.race_asian", "p.race_aian",
"p.race_nhpi", "p.race_other"), class = "data.frame", row.names = c(NA,
-79L))
This is not very elegant, but I think it works. 这不是很优雅,但我认为它有效。 Using loops, especially nested loops, is not very "R" since they are slow but also have side effects like cluttering your workspace. 使用循环,尤其是嵌套循环,不是很“R”,因为它们很慢,但也有副作用,如混乱你的工作区。
and you might want to change how this treats p.poc
if race is unspecified because it defaults to 1 which may not be what you want. 如果未指定种族,你可能想要改变p.poc
处理p.poc
,因为它默认为1,这可能不是你想要的。
So here is one way: 所以这是一种方式:
tmp <- lapply(1:nrow(p), function(ii) {
## this checks for columns that aren't blank or NA, takes the colname
## and strips off the prefix
tmp <- gsub('p.race_', '', names(p)[which(p[ii, -1] != '' & !is.na(p[ii, -1])) + 1])
## some special cases for > 1 race and blanks and p.poc
tmp <- ifelse(length(tmp) > 1, 'Two or more', tmp)
tmp[is.na(tmp)] <- 'Not specified'
tmp <- ifelse(p[ii, 1] %in% 'Yes', 'Hispanic or Latino', tmp)
p.poc <- (!grepl('white', tmp)) * 1
return(list(p.race = tmp, p.poc = p.poc))
})
head(do.call(rbind, tmp), 20)
# p.race p.poc
# [1,] "white" 0
# [2,] "white" 0
# [3,] "white" 0
# [4,] "white" 0
# [5,] "white" 0
# [6,] "white" 0
# [7,] "white" 0
# [8,] "white" 0
# [9,] "asian" 1
# [10,] "white" 0
# [11,] "other" 1
# [12,] "white" 0
# [13,] "white" 0
# [14,] "white" 0
# [15,] "Hispanic or Latino" 1
# [16,] "white" 0
# [17,] "white" 0
# [18,] "white" 0
# [19,] "white" 0
# [20,] "white" 0
## and combine back to the data frame
p <- cbind(p, do.call(rbind, tmp))
data: 数据:
p <- structure(list(p.hispanic = structure(c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "No", "Yes"
), class = "factor"), p.race_white = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L,
2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("",
"White, European, Middle Eastern, or Caucasian"), class = "factor"),
p.race_black = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"Black, African-American, or African"), class = "factor"),
p.race_asian = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("",
"Asian or Asian-American"), class = "factor"), p.race_aian = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("", "American Indian or Alaskan Native"
), class = "factor"), p.race_nhpi = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
p.race_other = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"Other (please specify)"), class = "factor")), .Names = c("p.hispanic",
"p.race_white", "p.race_black", "p.race_asian", "p.race_aian",
"p.race_nhpi", "p.race_other"), class = "data.frame", row.names = c(NA,
-79L))
The way my bring work, this sort of task always seems easier if the data are in a long format instead of a wide format. 我的工作方式,如果数据是长格式而不是宽格式,这种任务总是更容易。 However, this means a unique ID per response is needed - in a case like this you can just assign an integer to each row. 但是,这意味着需要每个响应的唯一ID - 在这种情况下,您可以为每行分配一个整数。
library(tidyr)
library(dplyr)
# Add individual ID to each row
p = mutate(p, id = 1:n())
Once that is done, I would do a little work to make the p.hispanic
column look more like the other race columns, put the dataset in a long format, remove all NA
/blanks, then make the two new variables. 完成后,我会做一些工作,使p.hispanic
列看起来更像其他种族列,将数据集放在长格式中,删除所有NA
/空格,然后创建两个新变量。 Once the new variables are made, they can be joined to the original. 创建新变量后,可以将它们连接到原始变量。 I use package tidyr for reshaping and dplyr for manipulation. 我使用包tidyr进行重新整形,使用dplyr进行操作。
p %>%
mutate(p.hispanic = ifelse(p.hispanic == "No", NA, "Hispanic or Latino")) %>% # change p.hispanic column
gather(category, answer, p.hispanic:p.race_other, na.rm = TRUE) %>%
filter(answer != "") %>% # get rid of blanks (if were NA would have removed in "gather")
group_by(id) %>%
# Create new variable p.race and p.pop based on rules
mutate(p.race = ifelse(n_distinct(answer) > 1, "Two or more races", answer),
p.poc = as.integer(p.race == "White, European, Middle Eastern, or Caucasian")) %>%
slice(1) %>% # take only 1 record for the duplicate id's
select(-category, - answer) %>% # remove columns that aren't needed
left_join(p, ., by = "id") %>% # join new columns with original dataset
select(-id) # remove ID column if not wanted
Once you have this dataset, you could reset the levels of p.race
with factor
if you want the levels to look a certain way. 获得此数据集后,如果希望级别以某种方式p.race
,则可以使用factor
重置p.race
的级别。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.