简体   繁体   中英

Create new R dataframe column based on conditions in other columns

I have a dataframe which has a date column, a column of ints (labelled value in the example below), and 12 other numeric columns, each corresponding to a month and labelled X1 (jan) through X12 (dec).

It looks something like:

date_var    value    X1       X2      X3     ...   X12
2016-01-01   100    1212     4161    9080    ...   383
2016-02-01   150    1212     4161    9080    ...   383
2016-03-01   150    1212     4161    9080    ...   383

What I'd like to do is create a new column, lets call it Z, which corresponds to the number in the value column, divided by the appropriate monthly value.

For example, in the table above Z for the 2016-01-01 entry would equal 100/1212, whereas the 2016-02-01 entry would instead divide by X2 for Feb and 2016-03-01 would have value divided by X3:

date_var    value    X1       X2      X3     ...   X12    Z
2016-01-01   100    1212     4161    9080    ...   383    0.0825
2016-02-01   150    1212     4161    9080    ...   383    0.0360
2016-03-01   150    1212     4161    9080    ...   383    0.0165

I've tried various approaches along the lines of attempting to divide value by df[paste("X", month(df$date_var), sep = '')] , although this returned list a rather than working element-wise so obviously isn't the correct approach.

Another good way using the dplyr and tidyr packages basically takes the R approach of converting your information to long data frame format (ie the same kind of information in the same column, here all your X1-X12) and then uses a filter condition to only consider the month values that match the month in your date variable:

library(dplyr)
library(tidyr)
library(lubridate)

# test data frame (code from parksw3)
data <- data_frame(
  date_var = as.Date(c("2016-01-01", "2016-02-01", "2016-03-01")),
  value = c(100, 150, 150),
  X1 = rep(1212, 3),
  X2 = rep(4161, 3),
  X3 = rep(9080, 3),
  X12 = rep(383, 3)
) 

# calculate the resulting Z column
result <- data %>% 
  # gather all the month (X1-X12) values into long format 
  # with month_var and month_value as key/value pair
  gather(month_var, month_value, starts_with("X")) %>% 
  # only consider the month_value for the month_var that matches the date's month
  filter(month_var == paste0("X", month(date_var))) %>% 
  # calculate the derived quantity
  mutate(Z = value/month_value)

print(result)

##     date_var value month_var month_value          Z
##       <date> <dbl>     <chr>       <dbl>      <dbl>
## 1 2016-01-01   100        X1        1212 0.08250825
## 2 2016-02-01   150        X2        4161 0.03604903
## 3 2016-03-01   150        X3        9080 0.01651982

If you want, you can merge it back into your original data frame:

data_all <- left_join(data, select(result, date_var, Z), by = "date_var")

print(data_all)

##     date_var value    X1    X2    X3   X12          Z
##       <date> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
## 1 2016-01-01   100  1212  4161  9080   383 0.08250825
## 2 2016-02-01   150  1212  4161  9080   383 0.03604903
## 3 2016-03-01   150  1212  4161  9080   383 0.01651982

Take a look at this post . I think there should be a simpler way but here's what I did based on that post and they both seem to work:

Data:

df <- data.frame(
    date_var = as.Date(c("2016-01-01", "2016-02-01", "2016-03-01")),
    value = c(100, 150, 150),
    X1 = rep(1212, 3),
    X2 = rep(4161, 3),
    X3 = rep(9080, 3),
    X12 = rep(383, 3)
)

Method 1:

m <- paste0("X", month(df$date_var))
sub <- cbind(1:nrow(df),
    match(m, names(df))
)
Z2 <- df$value/as.numeric(df[sub])
df2 <- cbind(df, Z2)

Method 2:

Z3 <- sapply(rownames(df), function(x){
    with(df[x,],{
        m <- month(date_var)
        value/get(paste0("X", m))
    })
})
df3 <- cbind(df, Z3)

Result:

##     date_var value   X1   X2   X3 X12         Z3
## 1 2016-01-01   100 1212 4161 9080 383 0.08250825
## 2 2016-02-01   150 1212 4161 9080 383 0.03604903
## 3 2016-03-01   150 1212 4161 9080 383 0.01651982
## 4 2017-02-01   150 1212 4161 9080 383 0.03604903

As an exploration into the trials of R indexing - a pseudo- tidyverse answer.

First let's generate some dummy data.

library(tidyverse)

data <- data_frame(
    date_var = seq(as.Date("2016-01-01"), by = "month", length.out = 12),
    value = ceiling(runif(12, 100, 200))
)

data %>%
    mutate(months = map(value, function(x){matrix(ceiling(runif(12, 50, 5000)), ncol = 12)}),
           months = map(months, as_data_frame)) %>%
    unnest(months) %>%
    as.data.frame() ->
    sample.data

head(sample.data)
#>     date_var value   V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  V11  V12
#> 1 2016-01-01   147 2004 2456 3983 4464 2473 2824 2038 1354 3433   51  574 1381
#> 2 2016-02-01   170 2862 3579  543 1458 2472  826 3865  528  187  951 4732 1849
#> 3 2016-03-01   107 2860 1359 4366 1824  173 3541  624   76 4113  771  808 3457
#> 4 2016-04-01   115 1707 4015 3951 2774 2726 1789 2189 1903 1706  124 3679 1876
#> 5 2016-05-01   120 1058 4169 2594 4334  221  494 2032 1425 2525 3358  791 3691
#> 6 2016-06-01   191 4169  570 3245 1682 3811 4350 2344 4338 2258  779 1835 2480

Now that we have some sample data, we can use dual indexing to extract the value of each column, based on the month. I'm assuming that the months are named V1 -- V12 (as they are in my dataset).

sample.data %>%
    mutate(Z = .[cbind(seq_along(nrow(.)), match(sprintf("V%s", month(date_var)), names(.)))], 
           Z = as.numeric(Z),
           Z = value / Z) ->
    result

head(result)
#>     date_var value   V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  V11  V12          Z
#> 1 2016-01-01   147 2004 2456 3983 4464 2473 2824 2038 1354 3433   51  574 1381 0.07335329
#> 2 2016-02-01   170 2862 3579  543 1458 2472  826 3865  528  187  951 4732 1849 0.06921824
#> 3 2016-03-01   107 2860 1359 4366 1824  173 3541  624   76 4113  771  808 3457 0.02686417
#> 4 2016-04-01   115 1707 4015 3951 2774 2726 1789 2189 1903 1706  124 3679 1876 0.02576165
#> 5 2016-05-01   120 1058 4169 2594 4334  221  494 2032 1425 2525 3358  791 3691 0.04852406
#> 6 2016-06-01   191 4169  570 3245 1682 3811 4350 2344 4338 2258  779 1835 2480 0.06763456

Not the most elegant way but you can use a for loop (assuming this is the layout of the data):

data = "yourData"
x = as.numeric(format(data[,1],"%m"))
for (i in 1:length(data[,1])){
data[i,"Z"] = data[i,2]/data[i,x[i]+2]
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM