重新编码单个变量分布在R中的几个

Question

I am working with survey data that has a question about race. 我正在处理有关种族问题的调查数据。 Each race category is its own variable. 每个种族类别都是自己的变量。 Here is what I want to do: 这是我想要做的：

Create a new variable, p.race . 创建一个新变量p.race 。
Assign p.race the value of one of the eight variables for race/ethnicity (below). 为p.race分配种族/种族的八个变量之一的值（下图）。
Determine whether an individual marked two or more races and assign p.race the value "Two or more races" in such cases. 在这种情况下，确定一个人是否标记了两个或更多种族，并将p.race指定为“两个或更多个种族”。
Assign p.race the value "Hispanic or Latino" when they indicated this ethnicity. 当p.race表示这种种族时，将其分配p.race “西班牙裔或拉丁裔”。
Create a new variable, p.poc , to indicate if they are a person of color (ie, not white, including Hispanic/Latino). 创建一个新变量p.poc ，以指示它们是否是有色人物（即非白色，包括西班牙裔/拉丁裔）。 This shall be 0 or 1. 这应该是0或1。

The eight race categories are white*, black*, Asian*, AIAN*, NHPI*, some other race*, two or more races*, and Hispanic; 八个种族分类是白色*，黑色*，亚洲*，AIAN *，NHPI *，其他一些种族*，两个或更多种族*和西班牙裔; where * denotes not Hispanic or Latino ethnicity. 其中*表示不是西班牙裔或拉丁裔。

Here is what I tried so far for parsing out "Two or more races": 这是我到目前为止解析“两场或更多场比赛”的原因：

p['p.race'] <- NA # create new variable for race

# list of variable names that store a string indicating the race
## e.g., `race_white` would be either blank or contain "White, European, Middle Eastern, or Caucasian"
race.list <- c('p.race_white', 'p.race_black', 'p.race_asian', 'p.race_aian', 'p.race_nhpi', 'p.race_other')

# iterate through each record
for ( n in 1:length(p) ) {
  multiflag = 0

  # iterate through the race list
  for ( i in race.list ) {

    # if it is not blank, +1 to multiflag
    if ( p$i[n] != '' ) {
      multiflag <- multiflag + 1
    }
  }

  # if multiflag was flagged more than once, assign "Two or more races" to `race`
  if ( multiflag > 1 ) {
    p$p.race[n] <- 'Two or more races'
  }
}

When executed, it returns this error: 执行时，它返回此错误：

> Error in if (p$i[n] != "") { : argument is of length zero

And here is my poc variable coding with error below: 这是我的poc变量编码，错误如下：

p['p.poc'] <- 0 # create a new variable for whether they are a person of color
for ( n in 1:length(p) ) {
  if ( p$p.race_black[n] == 'Black, African-American, or African'
       | p$p.race_asian[n] == 'Asian or Asian-American'
       | p$p.race_aian[n] == 'American Indian or Alaskan Native'
       | p$p.race_nhpi[n] == 'Native Hawaiian or other Pacific Islander'
       | p$p.race_other[n] == 'Other (please specify)'
       | p$p.hispanic[n] == 'Yes') {
    p$p.poc[n] <- 1
  }
}

> Error in if (p$p.race_black[n] == "Black, African-American, or African" |  : 
  missing value where TRUE/FALSE needed

I don't really know where to start for assigning the new race variable one of the eight race categories without making it a very long code. 我真的不知道从哪里开始为新的race变量分配八个种族类别中的一个而不会使它成为一个很长的代码。

If it helps, below are the survey questions: 如果有帮助，以下是调查问题：

Q1. Q1。 Do you consider yourself of Hispanic, Latino, or Spanish origin? 你认为自己是西班牙裔，拉丁裔或西班牙裔吗？

Yes 是
No 没有

Q2. Q2。 Which race do you identify with (check all that apply)? 你确定哪个种族（检查所有适用的）？

White, European, Middle Eastern, or Caucasian 白人，欧洲人，中东人或高加索人
Black, African-American, or African 黑人，非裔美国人或非洲人
Asian or Asian-American 亚洲人或亚裔美国人
American Indian or Alaskan Native 美洲印第安人或阿拉斯加原住民
Native Hawaiian or other Pacific Islander 夏威夷原住民或其他太平洋岛民
Other (please specify) 其他（请注明）

And here is the sample output (text truncated): 这是示例输出（文本截断）：

> p[264:271]
#    
#      p.hispanic  p.race_white p.race_black p.race_asian p.race_aian p.race_nhpi p.race_other
#   1  Yes         White
#   2  No          White
#   3  No                       Black
#   4  No          White                     Asian
#   5  Yes                                                                        Some other race

And here is a dput output: 这里是一个dput输出：

> dput(p[264:270])
structure(list(p.hispanic = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "No", "Yes"
), class = "factor"), p.race_white = structure(c(2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 
2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("", 
"White, European, Middle Eastern, or Caucasian"), class = "factor"), 
    p.race_black = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
    1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "Black, African-American, or African"), class = "factor"), 
    p.race_asian = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("", 
    "Asian or Asian-American"), class = "factor"), p.race_aian = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
    1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L), .Label = c("", "American Indian or Alaskan Native"
    ), class = "factor"), p.race_nhpi = c(NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
    p.race_other = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
    1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "Other (please specify)"), class = "factor")), .Names = c("p.hispanic", 
"p.race_white", "p.race_black", "p.race_asian", "p.race_aian", 
"p.race_nhpi", "p.race_other"), class = "data.frame", row.names = c(NA, 
-79L))

Answer 1

This is not very elegant, but I think it works. 这不是很优雅，但我认为它有效。 Using loops, especially nested loops, is not very "R" since they are slow but also have side effects like cluttering your workspace. 使用循环，尤其是嵌套循环，不是很“R”，因为它们很慢，但也有副作用，如混乱你的工作区。

and you might want to change how this treats p.poc if race is unspecified because it defaults to 1 which may not be what you want. 如果未指定种族，你可能想要改变p.poc处理p.poc ，因为它默认为1，这可能不是你想要的。

So here is one way: 所以这是一种方式：

tmp <- lapply(1:nrow(p), function(ii) {
  ## this checks for columns that aren't blank or NA, takes the colname
  ## and strips off the prefix
  tmp <- gsub('p.race_', '', names(p)[which(p[ii, -1] != '' & !is.na(p[ii, -1])) + 1])

  ## some special cases for > 1 race and blanks and p.poc
  tmp <- ifelse(length(tmp) > 1, 'Two or more', tmp)
  tmp[is.na(tmp)] <- 'Not specified'
  tmp <- ifelse(p[ii, 1] %in% 'Yes', 'Hispanic or Latino', tmp)
  p.poc <- (!grepl('white', tmp)) * 1

  return(list(p.race = tmp, p.poc = p.poc))
})

head(do.call(rbind, tmp), 20)

#   p.race               p.poc
# [1,] "white"               0    
# [2,] "white"               0    
# [3,] "white"               0    
# [4,] "white"               0    
# [5,] "white"               0    
# [6,] "white"               0    
# [7,] "white"               0    
# [8,] "white"               0    
# [9,] "asian"               1    
# [10,] "white"              0    
# [11,] "other"              1    
# [12,] "white"              0    
# [13,] "white"              0    
# [14,] "white"              0    
# [15,] "Hispanic or Latino" 1    
# [16,] "white"              0    
# [17,] "white"              0    
# [18,] "white"              0    
# [19,] "white"              0    
# [20,] "white"              0   

## and combine back to the data frame
p <- cbind(p, do.call(rbind, tmp))

data: 数据：

p <- structure(list(p.hispanic = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "No", "Yes"
), class = "factor"), p.race_white = structure(c(2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 
2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("", 
"White, European, Middle Eastern, or Caucasian"), class = "factor"), 
p.race_black = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
"Black, African-American, or African"), class = "factor"), 
p.race_asian = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("", 
"Asian or Asian-American"), class = "factor"), p.race_aian = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("", "American Indian or Alaskan Native"
), class = "factor"), p.race_nhpi = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
p.race_other = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
"Other (please specify)"), class = "factor")), .Names = c("p.hispanic", 
"p.race_white", "p.race_black", "p.race_asian", "p.race_aian", 
"p.race_nhpi", "p.race_other"), class = "data.frame", row.names = c(NA, 
  -79L))

Answer 2

The way my bring work, this sort of task always seems easier if the data are in a long format instead of a wide format. 我的工作方式，如果数据是长格式而不是宽格式，这种任务总是更容易。 However, this means a unique ID per response is needed - in a case like this you can just assign an integer to each row. 但是，这意味着需要每个响应的唯一ID - 在这种情况下，您可以为每行分配一个整数。

library(tidyr)
library(dplyr)

# Add individual ID to each row
p = mutate(p, id = 1:n())

Once that is done, I would do a little work to make the p.hispanic column look more like the other race columns, put the dataset in a long format, remove all NA /blanks, then make the two new variables. 完成后，我会做一些工作，使p.hispanic列看起来更像其他种族列，将数据集放在长格式中，删除所有NA /空格，然后创建两个新变量。 Once the new variables are made, they can be joined to the original. 创建新变量后，可以将它们连接到原始变量。 I use package tidyr for reshaping and dplyr for manipulation. 我使用包tidyr进行重新整形，使用dplyr进行操作。

p %>%
    mutate(p.hispanic = ifelse(p.hispanic == "No", NA, "Hispanic or Latino")) %>% # change p.hispanic column
    gather(category, answer, p.hispanic:p.race_other, na.rm = TRUE) %>%
    filter(answer != "") %>% # get rid of blanks (if were NA would have removed in "gather")
    group_by(id) %>%
    # Create new variable p.race and p.pop based on rules
    mutate(p.race = ifelse(n_distinct(answer) > 1, "Two or more races", answer),
          p.poc = as.integer(p.race == "White, European, Middle Eastern, or Caucasian")) %>%
    slice(1) %>% # take only 1 record for the duplicate id's
    select(-category, - answer) %>% # remove columns that aren't needed
    left_join(p, ., by = "id") %>% # join new columns with original dataset
    select(-id) # remove ID column if not wanted

Once you have this dataset, you could reset the levels of p.race with factor if you want the levels to look a certain way. 获得此数据集后，如果希望级别以某种方式p.race ，则可以使用factor重置p.race的级别。

重新编码单个变量分布在R中的几个

问题描述

2 个解决方案

解决方案1
2 2014-10-15 04:28:12

解决方案2
1 2014-10-15 17:24:10

重新编码单个变量分布在R中的几个

问题描述

2 个解决方案

解决方案1 2 2014-10-15 04:28:12

解决方案2 1 2014-10-15 17:24:10

解决方案1
2 2014-10-15 04:28:12

解决方案2
1 2014-10-15 17:24:10