简体   繁体   English

重新编码r中的多个列

[英]Recode range multiple columns in r

I cannot find an answer to this specific question. 我无法找到这个具体问题的答案。 I would like to recode multiple character columns into numeric columns. 我想将多个字符列重新编码为数字列。 (It is a hundred columns) But: (这是一百列)但是:

  • columns will not always be in the same order (I recode the refreshed data every month). 列不会总是以相同的顺序(我每个月重新编码刷新的数据)。
  • columns are separated by columns that I do not wish to recode. 列由我不想重新编码的列分隔。
  • dataset does not always include the same columns. 数据集并不总是包含相同的列。

So, I do not think I can use a range of column indexes. 所以,我认为我不能使用一系列列索引。 However, the columns I wish to recode start with the same column name prefix. 但是,我希望重新编码的列以相同的列名前缀开头。 I would like to recode any "Yes" to 1, "No" to 0, and blanks to NA. 我想将任何“是”重新编码为1,将“否”重新编码为0,并将空白重新编码为NA。

I could do this manually one column at a time with the below code: 我可以使用以下代码一次手动执行此操作:

    #Recode columns one at a time

    library(car)
    #skip ID column
    #Skip Date column
    df$Q1<-as.numeric(as.character(recode(df$Q1,"NA=NA; 'No'=0; 'Yes'=1; ''=NA")))
    df$Q2<-as.numeric(as.character(recode(df$Q2,"NA=NA; 'No'=0; 'Yes'=1; ''=NA")))
    #skip Q2.Explanation column
    #do the above for a hundred more columns...

But I would like to recode a hundred, specific columns at the same time. 但我想同时重新编写一百个特定列。 Also these columns are separated by columns I do not wish to recode. 这些列也是由我不想重新编码的列分隔的。

My data is below. 我的数据如下。 Not sure what is dput: 不确定什么是dput:

    ID<-c(01,02,03,04,05)
    Q1<-c("Yes", NA,"", "No",NA)
    Q1.Explanation<-c (NA, NA,"","Respondent did not get the correct answer", NA)
    Q2<-c("No","Yes","Yes","", NA)
    Q2.Explanation <-c("The right answer was not proven", NA, NA, NA, NA)
    Q3<-c("", NA, "Yes", NA, NA)
    Mydata<-as.data.frame(cbind(ID,Q1,Q1.Explanation, Q2, Q2.Explanation,Q3))

If you know that the columns you want to change always have the same names, just different locations in the table, then you can use regex on the column names to subset, then change the values in the columns with apply() . 如果您知道要更改的列始终具有相同的名称,只是表中的不同位置,则可以使用列名称上的正则表达式进行子集化,然后使用apply()更改列中的值。

your_data[, grep("Q", colnames(your_data))] <- as.data.frame(apply(your_data[, grep("Q", colnames(your_data))], 
                               2, 
                               function(x) recode(x, "NA = NA; 'No' = 0; 'Yes' = 1; '' = NA")))

This should recode all of your columns that begin with "Q" regardless of their location any given month. 这应该重新编码以“Q”开头的所有列,无论它们在给定月份的位置如何。

For data.table fans I have another solution that also has the advantage of using factors instead of numeric integers for the recoding so that the meaning of the numeric values is still displayed correctly (improving the readability of your data): 对于data.table粉丝,我有另一个解决方案,它还具有使用factors而不是数字整数进行重新编码的优势,这样数字值的含义仍然可以正确显示 (提高数据的可读性):

library(data.table)

ID<-c(01,02,03,04,05)
Q1<-c("Yes", NA,"", "No",NA)
Q1.Explanation<-c (NA, NA,"","Respondent did not get the correct answer", NA)
Q2<-c("No","Yes","Yes","", NA)
Q2.Explanation <-c("The right answer was not proven", NA, NA, NA, NA)
Q3<-c("", NA, "Yes", NA, NA)
Mydata<-as.data.frame(cbind(ID,Q1,Q1.Explanation, Q2, Q2.Explanation,Q3))

Mydata

# The solution starts here... ----------------------------------------------

setDT(Mydata)     # convert data.frame into data.table

# the regular expression selects all column names starting with a "Q" followed by digits until the end
affected.cols <- colnames(Mydata)[grep("^Q\\d+$", colnames(Mydata))]

# convert the columns to factors; trailing square brackets are only added to print the output
Mydata[, (affected.cols) := lapply(affected.cols, function(x) { .SD[, factor(get(x), c("No", "Yes")) ] })] []

str(Mydata)           # Columns are encoded as factors ("enumerated types") now, which is an integer internally that has a string label

# Proof: 1 = "No", 2 = "Yes"; the "excluded" parameter of "factor()" caused all other values (mainly empty strings) to be translated into NAs
as.numeric(Mydata$Q1)

Which results in: 结果如下:

> as.numeric(Mydata$Q1)
[1]  2 NA NA  1 NA


> Mydata
   ID  Q1                            Q1.Explanation  Q2                  Q2.Explanation  Q3
1:  1 Yes                                        NA  No The right answer was not proven  NA
2:  2  NA                                        NA Yes                              NA  NA
3:  3  NA                                           Yes                              NA Yes
4:  4  No Respondent did not get the correct answer  NA                              NA  NA
5:  5  NA                                        NA  NA                              NA  NA

The correct translation to the numeric values is due to lucky circumstance that the requested numeric values start with 1 so that the "No" has the level index 1 and "Yes" the level index 2. 对数值的正确转换是由于幸运的情况,所请求的数值以1开始,因此“否”具有级别索引1并且“是”级别索引2。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM