简体   繁体   中英

Add new calculated column in a r dataframe based on another column

I have a dataframe in R, called "data" like this

c1      c2  c3
A1000   "x" 100 
A1200   "x" 200 
A3000   "y" 150 
A2000   "x" 250 
A3200   "t" 100 
A1000   "e" 250 
A1200   "w" 300

I need to create another column, lets say "c4", whith the category name based on following criteria:

Code        Name
-----------------------------
A10         "Activity 1"
A12         "Activity 2"
A20         "Activity 3"
other code  "Other activity"

Where "code" corresponds to the first 3 characters of column c1 in my data. I have the following code in R:

cat_x <- function(data_x){
  if(substr(data_x, star=1, stop=3) == "A10"){
    return("Activity 1")
  } else if(substr(data_x, star=1, stop=3) == "A12") {
    return("Activity 2")
  } else if(substr(data_x, star=1, stop=3) == "A20") {
    return("Activity 3")
  } else {
    return("Other activity")
  }
  
}

data["c4"] <- cat_x(data$c1)

However I get the following error: "the condition has length > 1 and only the first element will be used"

Please help me to solve this, using my function "cat_x".

Thanks in advance

Use sapply :

df$c4 <- sapply(df$c1, cat_x)

Your code is not vectorized, so it's not coded to deal with an entire vector at once. Instead it deals with one element at a time, which is how sapply will use it.


You could also use the library dplyr and case_when to code this like:

library(dplyr)

df %>% 
  mutate(c4 = case_when( 
    startsWith(c1, "A10") ~ "Activity 1",
    startsWith(c1, "A12") ~ "Activity 2",
    startsWith(c1, "A20") ~ "Activity 3",
    T ~ "Other Activity"))

Output

     c1 c2  c3             c4
1 A1000  x 100     Activity 1
2 A1200  x 200     Activity 2
3 A3000  y 150 Other Activity
4 A2000  x 250     Activity 3
5 A3200  t 100 Other Activity
6 A1000  e 250     Activity 1
7 A1200  w 300     Activity 2

There are definitely better solutions out there but this one is the closest to your own. You first have to create an empty vector of type character with the same length as the number of rows in your data frame named c4 . Then you iterate over the first column whose first three characters you would like to extract and then fill your c4 in every iteration with the right match.

cat_x <- function(data_x){
  c4 <- vector("character", length = nrow(data_x))
  
  for(i in 1:nrow(data_x)) {
    if(substr(data_x[i, 1], star = 1, stop = 3) == "A10"){
      c4[[i]] <- "Activity 1"
    } else if(substr(data_x[i, 1], star = 1, stop = 3) == "A12") {
      c4[[i]] <- "Activity 2"
    } else if(substr(data_x[i, 1], star = 1, stop = 3) == "A20") {
      c4[[i]] <- "Activity 3"
    } else {
      c4[[i]] <- "Other activity"
    }
  }
  cbind(data_x, c4)
}

cat_x(df)

     c1 c2  c3             c4
1 A1000  x 100     Activity 1
2 A1200  x 200     Activity 2
3 A3000  y 150 Other activity
4 A2000  x 250     Activity 3
5 A3200  t 100 Other activity
6 A1000  e 250     Activity 1
7 A1200  w 300     Activity 2

Data

df <- read.table(header = TRUE, text = "
                 c1      c2  c3
A1000   x 100 
A1200   x 200 
A3000   y 150 
A2000   x 250 
A3200   t 100 
A1000   e 250 
A1200   w 300")

This is a standard merge operation. First make your codes into a data frame and use dput to make them easily available:

data <- structure(list(c1 = c("A1000", "A1200", "A3000", "A2000", "A3200", 
    "A1000", "A1200"), c2 = c("x", "x", "y", "x", "t", "e", "w"), 
    c3 = c(100L, 200L, 150L, 250L, 100L, 250L, 300L)), class = "data.frame",
    row.names = c(NA, -7L))

codes <- structure(list(Code = c("A10", "A12", "A20"), Name = c("Activity 1", 
    "Activity 2", "Activity 3")), class = "data.frame", row.names = c(NA, -3L))

Now create a column in data that matches the code and merge:

data$Code <- substr(data$c1, 1, 3)
data.mrg <- merge(data, codes, all=TRUE)
#   Code    c1 c2  c3       Name
# 1  A10 A1000  x 100 Activity 1
# 2  A10 A1000  e 250 Activity 1
# 3  A12 A1200  x 200 Activity 2
# 4  A12 A1200  w 300 Activity 2
# 5  A20 A2000  x 250 Activity 3
# 6  A30 A3000  y 150       <NA>
# 7  A32 A3200  t 100       <NA>

If you want to remove the Code column and rename Name to c4 :

data.mrg <- data.mrg[, -1]      # Optional to get rid of first column
colnames(data.mrg)[4] <- "c4".  # Optional to change column name
data.mrg
#      c1 c2  c3         c4
# 1 A1000  x 100 Activity 1
# 2 A1000  e 250 Activity 1
# 3 A1200  x 200 Activity 2
# 4 A1200  w 300 Activity 2
# 5 A2000  x 250 Activity 3
# 6 A3000  y 150       <NA>
# 7 A3200  t 100       <NA>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM