[英]Create new R dataframe column based on conditions in other columns
I have a dataframe which has a date column, a column of ints (labelled value
in the example below), and 12 other numeric columns, each corresponding to a month and labelled X1
(jan) through X12
(dec). 我有一个数据框,其中有一个日期列,一个整数列(在下面的示例中为标记
value
)和其他12个数字列,每个数字列对应一个月,并标记为X1
(jan)至X12
(dec)。
It looks something like: 看起来像:
date_var value X1 X2 X3 ... X12
2016-01-01 100 1212 4161 9080 ... 383
2016-02-01 150 1212 4161 9080 ... 383
2016-03-01 150 1212 4161 9080 ... 383
What I'd like to do is create a new column, lets call it Z, which corresponds to the number in the value
column, divided by the appropriate monthly value. 我想做的是创建一个新列,称为Z,它对应于
value
列中的数字,再除以适当的月度值。
For example, in the table above Z for the 2016-01-01
entry would equal 100/1212, whereas the 2016-02-01
entry would instead divide by X2 for Feb and 2016-03-01
would have value
divided by X3: 例如,在上述Z表示的表
2016-01-01
的条目将等于一千二百一十二分之百,而2016-02-01
的项将通过X2为二月代替分裂和2016-03-01
将具有value
由X3分为:
date_var value X1 X2 X3 ... X12 Z
2016-01-01 100 1212 4161 9080 ... 383 0.0825
2016-02-01 150 1212 4161 9080 ... 383 0.0360
2016-03-01 150 1212 4161 9080 ... 383 0.0165
I've tried various approaches along the lines of attempting to divide value
by df[paste("X", month(df$date_var), sep = '')]
, although this returned list a rather than working element-wise so obviously isn't the correct approach. 我尝试了各种方法,尝试将
value
除以df[paste("X", month(df$date_var), sep = '')]
,尽管此返回列表显然不是一个明智的选择不是正确的方法。
Another good way using the dplyr
and tidyr
packages basically takes the R approach of converting your information to long data frame format (ie the same kind of information in the same column, here all your X1-X12) and then uses a filter condition to only consider the month values that match the month in your date variable: 使用
dplyr
和tidyr
软件包的另一种好方法基本上是采用R方法,将您的信息转换为长数据帧格式(即,同一列中的相同类型的信息,这里是所有X1-X12),然后使用过滤条件仅在日期变量中考虑与月份匹配的月份值:
library(dplyr)
library(tidyr)
library(lubridate)
# test data frame (code from parksw3)
data <- data_frame(
date_var = as.Date(c("2016-01-01", "2016-02-01", "2016-03-01")),
value = c(100, 150, 150),
X1 = rep(1212, 3),
X2 = rep(4161, 3),
X3 = rep(9080, 3),
X12 = rep(383, 3)
)
# calculate the resulting Z column
result <- data %>%
# gather all the month (X1-X12) values into long format
# with month_var and month_value as key/value pair
gather(month_var, month_value, starts_with("X")) %>%
# only consider the month_value for the month_var that matches the date's month
filter(month_var == paste0("X", month(date_var))) %>%
# calculate the derived quantity
mutate(Z = value/month_value)
print(result)
## date_var value month_var month_value Z
## <date> <dbl> <chr> <dbl> <dbl>
## 1 2016-01-01 100 X1 1212 0.08250825
## 2 2016-02-01 150 X2 4161 0.03604903
## 3 2016-03-01 150 X3 9080 0.01651982
If you want, you can merge it back into your original data frame: 如果需要,可以将其合并回原始数据框中:
data_all <- left_join(data, select(result, date_var, Z), by = "date_var")
print(data_all)
## date_var value X1 X2 X3 X12 Z
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2016-01-01 100 1212 4161 9080 383 0.08250825
## 2 2016-02-01 150 1212 4161 9080 383 0.03604903
## 3 2016-03-01 150 1212 4161 9080 383 0.01651982
Take a look at this post . 看一下这篇文章 。 I think there should be a simpler way but here's what I did based on that post and they both seem to work:
我认为应该有一种更简单的方法,但这是我根据那篇文章所做的,而且它们似乎都可以工作:
Data: 数据:
df <- data.frame(
date_var = as.Date(c("2016-01-01", "2016-02-01", "2016-03-01")),
value = c(100, 150, 150),
X1 = rep(1212, 3),
X2 = rep(4161, 3),
X3 = rep(9080, 3),
X12 = rep(383, 3)
)
Method 1: 方法1:
m <- paste0("X", month(df$date_var))
sub <- cbind(1:nrow(df),
match(m, names(df))
)
Z2 <- df$value/as.numeric(df[sub])
df2 <- cbind(df, Z2)
Method 2: 方法2:
Z3 <- sapply(rownames(df), function(x){
with(df[x,],{
m <- month(date_var)
value/get(paste0("X", m))
})
})
df3 <- cbind(df, Z3)
Result: 结果:
## date_var value X1 X2 X3 X12 Z3
## 1 2016-01-01 100 1212 4161 9080 383 0.08250825
## 2 2016-02-01 150 1212 4161 9080 383 0.03604903
## 3 2016-03-01 150 1212 4161 9080 383 0.01651982
## 4 2017-02-01 150 1212 4161 9080 383 0.03604903
As an exploration into the trials of R indexing - a pseudo- tidyverse
answer. 作为对R索引试验的探索-伪
tidyverse
答案。
First let's generate some dummy data. 首先,让我们生成一些虚拟数据。
library(tidyverse)
data <- data_frame(
date_var = seq(as.Date("2016-01-01"), by = "month", length.out = 12),
value = ceiling(runif(12, 100, 200))
)
data %>%
mutate(months = map(value, function(x){matrix(ceiling(runif(12, 50, 5000)), ncol = 12)}),
months = map(months, as_data_frame)) %>%
unnest(months) %>%
as.data.frame() ->
sample.data
head(sample.data)
#> date_var value V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#> 1 2016-01-01 147 2004 2456 3983 4464 2473 2824 2038 1354 3433 51 574 1381
#> 2 2016-02-01 170 2862 3579 543 1458 2472 826 3865 528 187 951 4732 1849
#> 3 2016-03-01 107 2860 1359 4366 1824 173 3541 624 76 4113 771 808 3457
#> 4 2016-04-01 115 1707 4015 3951 2774 2726 1789 2189 1903 1706 124 3679 1876
#> 5 2016-05-01 120 1058 4169 2594 4334 221 494 2032 1425 2525 3358 791 3691
#> 6 2016-06-01 191 4169 570 3245 1682 3811 4350 2344 4338 2258 779 1835 2480
Now that we have some sample data, we can use dual indexing to extract the value of each column, based on the month. 现在我们有了一些样本数据,我们可以使用双索引根据月份提取每列的值。 I'm assuming that the months are named
V1
-- V12
(as they are in my dataset). 我假设月份被命名为
V1
- V12
(就像我的数据集中一样)。
sample.data %>%
mutate(Z = .[cbind(seq_along(nrow(.)), match(sprintf("V%s", month(date_var)), names(.)))],
Z = as.numeric(Z),
Z = value / Z) ->
result
head(result)
#> date_var value V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 Z
#> 1 2016-01-01 147 2004 2456 3983 4464 2473 2824 2038 1354 3433 51 574 1381 0.07335329
#> 2 2016-02-01 170 2862 3579 543 1458 2472 826 3865 528 187 951 4732 1849 0.06921824
#> 3 2016-03-01 107 2860 1359 4366 1824 173 3541 624 76 4113 771 808 3457 0.02686417
#> 4 2016-04-01 115 1707 4015 3951 2774 2726 1789 2189 1903 1706 124 3679 1876 0.02576165
#> 5 2016-05-01 120 1058 4169 2594 4334 221 494 2032 1425 2525 3358 791 3691 0.04852406
#> 6 2016-06-01 191 4169 570 3245 1682 3811 4350 2344 4338 2258 779 1835 2480 0.06763456
Not the most elegant way but you can use a for loop (assuming this is the layout of the data): 不是最优雅的方法,但是可以使用for循环(假设这是数据的布局):
data = "yourData"
x = as.numeric(format(data[,1],"%m"))
for (i in 1:length(data[,1])){
data[i,"Z"] = data[i,2]/data[i,x[i]+2]
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.