简体   繁体   English

根据其他列中的条件创建新的R数据框列

[英]Create new R dataframe column based on conditions in other columns

I have a dataframe which has a date column, a column of ints (labelled value in the example below), and 12 other numeric columns, each corresponding to a month and labelled X1 (jan) through X12 (dec). 我有一个数据框,其中有一个日期列,一个整数列(在下面的示例中为标记value )和其他12个数字列,每个数字列对应一个月,并标记为X1 (jan)至X12 (dec)。

It looks something like: 看起来像:

date_var    value    X1       X2      X3     ...   X12
2016-01-01   100    1212     4161    9080    ...   383
2016-02-01   150    1212     4161    9080    ...   383
2016-03-01   150    1212     4161    9080    ...   383

What I'd like to do is create a new column, lets call it Z, which corresponds to the number in the value column, divided by the appropriate monthly value. 我想做的是创建一个新列,称为Z,它对应于value列中的数字,再除以适当的月度值。

For example, in the table above Z for the 2016-01-01 entry would equal 100/1212, whereas the 2016-02-01 entry would instead divide by X2 for Feb and 2016-03-01 would have value divided by X3: 例如,在上述Z表示的表2016-01-01的条目将等于一千二百一十二分之百,而2016-02-01的项将通过X2为二月代替分裂和2016-03-01将具有value由X3分为:

date_var    value    X1       X2      X3     ...   X12    Z
2016-01-01   100    1212     4161    9080    ...   383    0.0825
2016-02-01   150    1212     4161    9080    ...   383    0.0360
2016-03-01   150    1212     4161    9080    ...   383    0.0165

I've tried various approaches along the lines of attempting to divide value by df[paste("X", month(df$date_var), sep = '')] , although this returned list a rather than working element-wise so obviously isn't the correct approach. 我尝试了各种方法,尝试将value除以df[paste("X", month(df$date_var), sep = '')] ,尽管此返回列表显然不是一个明智的选择不是正确的方法。

Another good way using the dplyr and tidyr packages basically takes the R approach of converting your information to long data frame format (ie the same kind of information in the same column, here all your X1-X12) and then uses a filter condition to only consider the month values that match the month in your date variable: 使用dplyrtidyr软件包的另一种好方法基本上是采用R方法,将您的信息转换为长数据帧格式(即,同一列中的相同类型的信息,这里是所有X1-X12),然后使用过滤条件仅在日期变量中考虑与月份匹配的月份值:

library(dplyr)
library(tidyr)
library(lubridate)

# test data frame (code from parksw3)
data <- data_frame(
  date_var = as.Date(c("2016-01-01", "2016-02-01", "2016-03-01")),
  value = c(100, 150, 150),
  X1 = rep(1212, 3),
  X2 = rep(4161, 3),
  X3 = rep(9080, 3),
  X12 = rep(383, 3)
) 

# calculate the resulting Z column
result <- data %>% 
  # gather all the month (X1-X12) values into long format 
  # with month_var and month_value as key/value pair
  gather(month_var, month_value, starts_with("X")) %>% 
  # only consider the month_value for the month_var that matches the date's month
  filter(month_var == paste0("X", month(date_var))) %>% 
  # calculate the derived quantity
  mutate(Z = value/month_value)

print(result)

##     date_var value month_var month_value          Z
##       <date> <dbl>     <chr>       <dbl>      <dbl>
## 1 2016-01-01   100        X1        1212 0.08250825
## 2 2016-02-01   150        X2        4161 0.03604903
## 3 2016-03-01   150        X3        9080 0.01651982

If you want, you can merge it back into your original data frame: 如果需要,可以将其合并回原始数据框中:

data_all <- left_join(data, select(result, date_var, Z), by = "date_var")

print(data_all)

##     date_var value    X1    X2    X3   X12          Z
##       <date> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
## 1 2016-01-01   100  1212  4161  9080   383 0.08250825
## 2 2016-02-01   150  1212  4161  9080   383 0.03604903
## 3 2016-03-01   150  1212  4161  9080   383 0.01651982

Take a look at this post . 看一下这篇文章 I think there should be a simpler way but here's what I did based on that post and they both seem to work: 我认为应该有一种更简单的方法,但这是我根据那篇文章所做的,而且它们似乎都可以工作:

Data: 数据:

df <- data.frame(
    date_var = as.Date(c("2016-01-01", "2016-02-01", "2016-03-01")),
    value = c(100, 150, 150),
    X1 = rep(1212, 3),
    X2 = rep(4161, 3),
    X3 = rep(9080, 3),
    X12 = rep(383, 3)
)

Method 1: 方法1:

m <- paste0("X", month(df$date_var))
sub <- cbind(1:nrow(df),
    match(m, names(df))
)
Z2 <- df$value/as.numeric(df[sub])
df2 <- cbind(df, Z2)

Method 2: 方法2:

Z3 <- sapply(rownames(df), function(x){
    with(df[x,],{
        m <- month(date_var)
        value/get(paste0("X", m))
    })
})
df3 <- cbind(df, Z3)

Result: 结果:

##     date_var value   X1   X2   X3 X12         Z3
## 1 2016-01-01   100 1212 4161 9080 383 0.08250825
## 2 2016-02-01   150 1212 4161 9080 383 0.03604903
## 3 2016-03-01   150 1212 4161 9080 383 0.01651982
## 4 2017-02-01   150 1212 4161 9080 383 0.03604903

As an exploration into the trials of R indexing - a pseudo- tidyverse answer. 作为对R索引试验的探索-伪tidyverse答案。

First let's generate some dummy data. 首先,让我们生成一些虚拟数据。

library(tidyverse)

data <- data_frame(
    date_var = seq(as.Date("2016-01-01"), by = "month", length.out = 12),
    value = ceiling(runif(12, 100, 200))
)

data %>%
    mutate(months = map(value, function(x){matrix(ceiling(runif(12, 50, 5000)), ncol = 12)}),
           months = map(months, as_data_frame)) %>%
    unnest(months) %>%
    as.data.frame() ->
    sample.data

head(sample.data)
#>     date_var value   V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  V11  V12
#> 1 2016-01-01   147 2004 2456 3983 4464 2473 2824 2038 1354 3433   51  574 1381
#> 2 2016-02-01   170 2862 3579  543 1458 2472  826 3865  528  187  951 4732 1849
#> 3 2016-03-01   107 2860 1359 4366 1824  173 3541  624   76 4113  771  808 3457
#> 4 2016-04-01   115 1707 4015 3951 2774 2726 1789 2189 1903 1706  124 3679 1876
#> 5 2016-05-01   120 1058 4169 2594 4334  221  494 2032 1425 2525 3358  791 3691
#> 6 2016-06-01   191 4169  570 3245 1682 3811 4350 2344 4338 2258  779 1835 2480

Now that we have some sample data, we can use dual indexing to extract the value of each column, based on the month. 现在我们有了一些样本数据,我们可以使用双索引根据月份提取每列的值。 I'm assuming that the months are named V1 -- V12 (as they are in my dataset). 我假设月份被命名为V1 - V12 (就像我的数据集中一样)。

sample.data %>%
    mutate(Z = .[cbind(seq_along(nrow(.)), match(sprintf("V%s", month(date_var)), names(.)))], 
           Z = as.numeric(Z),
           Z = value / Z) ->
    result

head(result)
#>     date_var value   V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  V11  V12          Z
#> 1 2016-01-01   147 2004 2456 3983 4464 2473 2824 2038 1354 3433   51  574 1381 0.07335329
#> 2 2016-02-01   170 2862 3579  543 1458 2472  826 3865  528  187  951 4732 1849 0.06921824
#> 3 2016-03-01   107 2860 1359 4366 1824  173 3541  624   76 4113  771  808 3457 0.02686417
#> 4 2016-04-01   115 1707 4015 3951 2774 2726 1789 2189 1903 1706  124 3679 1876 0.02576165
#> 5 2016-05-01   120 1058 4169 2594 4334  221  494 2032 1425 2525 3358  791 3691 0.04852406
#> 6 2016-06-01   191 4169  570 3245 1682 3811 4350 2344 4338 2258  779 1835 2480 0.06763456

Not the most elegant way but you can use a for loop (assuming this is the layout of the data): 不是最优雅的方法,但是可以使用for循环(假设这是数据的布局):

data = "yourData"
x = as.numeric(format(data[,1],"%m"))
for (i in 1:length(data[,1])){
data[i,"Z"] = data[i,2]/data[i,x[i]+2]
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM