[英]Discounting values based on previous appearance in r
I have the following (sample) dataset:我有以下(示例)数据集:
df <- data.table(
firm = rep(c("A","B"),each=20),
year = c(1994,1994,1994,1994,1994,1994,1994,1994,1994,1994,2000,2000,2000,2002,2002,2002,2003,2003,2003,2003,1975,1975,1975,1975,1975,1975,1976,1976,1977,1977,1977,1977,1977,1977,1977,1977,1977,1978,1978,1978),
patent_number=c(5505081,5505081,5606110,5606110,5837890,5837890,5837890,5837890,5837890,5837890,6725912,6725912,6725912,6748800,6748800,6748800,7153136,6997049,6997049,6997049,4026555,4026555,4026555,4026555,4026555,4026555,4155095,4155095,4137556,4137556,4137556,4137556,4137556,4137556,4137556,4137556,4137556,4253157,4253157,4253157),
class=c("73","73","73","73","73","73","73","73","73","73","165","165","165","73","73","73","434","73","73","73","463","463","463","463","463","463","348","348","361","361","361","361","361","361","361","361","361","707","707","707"),
sub_class=c("147","12.07","12.08","147","116.03","184","185","186","198","210","144",
"147","140","147","E29.272","E29.309","59","147","E29.272","E29.309","3","168",
"473","960","552","31","701","593","91.2","72","58","59","111",
"222","93.05","117","709","104.1","93.25","552"),
sub_new_I = c(0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1)
)
> df
firm year patent_number class sub_class sub_new_I
1: A 1994 5505081 73 147 0
2: A 1994 5505081 73 12.07 0
3: A 1994 5606110 73 12.08 0
4: A 1994 5606110 73 147 0
5: A 1994 5837890 73 116.03 0
6: A 1994 5837890 73 184 0
7: A 1994 5837890 73 185 0
8: A 1994 5837890 73 186 0
9: A 1994 5837890 73 198 0
10: A 1994 5837890 73 210 0
11: A 2000 6725912 165 144 0
12: A 2000 6725912 165 147 1
13: A 2000 6725912 165 140 0
14: A 2002 6748800 73 147 1
15: A 2002 6748800 73 E29.272 0
16: A 2002 6748800 73 E29.309 0
17: A 2003 7153136 434 59 0
18: A 2003 6997049 73 147 1
19: A 2003 6997049 73 E29.272 1
20: A 2003 6997049 73 E29.309 1
21: B 1975 4026555 463 3 0
22: B 1975 4026555 463 168 0
23: B 1975 4026555 463 473 0
24: B 1975 4026555 463 960 0
25: B 1975 4026555 463 552 0
26: B 1975 4026555 463 31 0
27: B 1976 4155095 348 701 0
28: B 1976 4155095 348 593 0
29: B 1977 4137556 361 91.2 0
30: B 1977 4137556 361 72 0
31: B 1977 4137556 361 58 0
32: B 1977 4137556 361 59 0
33: B 1977 4137556 361 111 0
34: B 1977 4137556 361 222 0
35: B 1977 4137556 361 93.05 0
36: B 1977 4137556 361 117 0
37: B 1977 4137556 361 709 0
38: B 1978 4253157 707 104.1 0
39: B 1978 4253157 707 93.25 0
40: B 1978 4253157 707 552 1
The column sub_new_I
indicates whether a firm has previously filed a patent in that specific subclass (by a binary indicator). sub_new_I
列表示公司之前是否曾在该特定子类中提交过专利(通过二元指标)。
My goal is to apply a time discount to decrease the value of more distant history.我的目标是应用时间折扣来降低更远历史的价值。 The logic is that a subclass that appears 10 years ago is different from the one that was filed 2 years ago.
逻辑是 10 年前出现的子类与 2 年前提交的子类不同。 To do this, I want to rely on an exponential discount with a power less than 1, explained as follows: For example, Firm A has filed a patent (with
patent_number
= 6725912) in 2000 which includes subclass 147. The same subclass appears in year 1994 as well ( patent_number
=5606110).为此,我想依靠幂小于 1 的指数折扣,解释如下:例如,公司 A 在 2000 年申请了专利(专利号 =
patent_number
),其中包括子类 147。相同的子类出现在1994 年也是如此(专利号 = patent_number
)。 The indicator ( sub_class_I
) is therefore equal to 1 in year 2000. But applying the time discount, I want the indicator to be exp(-( (2000-1994)/5 ) )=0.301.因此,指标 (
sub_class_I
) 在 2000 年等于 1。但是应用时间折扣,我希望指标为 exp(-( (2000-1994)/5 ) )=0.301。 Essentially, the discount factor is given by this formula:本质上,折扣因子由以下公式给出:
Here, the constant for knowledge loss is equal to 5. (This means that the denominator is equal to 5 in all calculations).这里,知识损失的常数等于 5。(这意味着分母在所有计算中都等于 5)。
The desired output is, therefore:因此,所需的 output 是:
> df
firm year patent_number class sub_class sub_new_I discounted_I
1: A 1994 5505081 73 147 0 0.0000
2: A 1994 5505081 73 12.07 0 0.0000
3: A 1994 5606110 73 12.08 0 0.0000
4: A 1994 5606110 73 147 0 0.0000
5: A 1994 5837890 73 116.03 0 0.0000
6: A 1994 5837890 73 184 0 0.0000
7: A 1994 5837890 73 185 0 0.0000
8: A 1994 5837890 73 186 0 0.0000
9: A 1994 5837890 73 198 0 0.0000
10: A 1994 5837890 73 210 0 0.0000
11: A 2000 6725912 165 144 0 0.0000
12: A 2000 6725912 165 147 1 0.3012
13: A 2000 6725912 165 140 0 0.0000
14: A 2002 6748800 73 147 1 0.6703
15: A 2002 6748800 73 E29.272 0 0.0000
16: A 2002 6748800 73 E29.309 0 0.0000
17: A 2003 7153136 434 59 0 0.0000
18: A 2003 6997049 73 147 1 0.8187
19: A 2003 6997049 73 E29.272 1 0.8187
20: A 2003 6997049 73 E29.309 1 0.8187
21: B 1975 4026555 463 3 0 0.0000
22: B 1975 4026555 463 168 0 0.0000
23: B 1975 4026555 463 473 0 0.0000
24: B 1975 4026555 463 960 0 0.0000
25: B 1975 4026555 463 552 0 0.0000
26: B 1975 4026555 463 31 0 0.0000
27: B 1976 4155095 348 701 0 0.0000
28: B 1976 4155095 348 593 0 0.0000
29: B 1977 4137556 361 91.2 0 0.0000
30: B 1977 4137556 361 72 0 0.0000
31: B 1977 4137556 361 58 0 0.0000
32: B 1977 4137556 361 59 0 0.0000
33: B 1977 4137556 361 111 0 0.0000
34: B 1977 4137556 361 222 0 0.0000
35: B 1977 4137556 361 93.05 0 0.0000
36: B 1977 4137556 361 117 0 0.0000
37: B 1977 4137556 361 709 0 0.0000
38: B 1978 4253157 707 104.1 0 0.0000
39: B 1978 4253157 707 93.25 0 0.0000
40: B 1978 4253157 707 552 1 0.5488
My dataset is big, so a data.table
solution greatly appreciated.我的数据集很大,因此非常感谢
data.table
解决方案。
Thanks in advance for any help.提前感谢您的帮助。
Create a row index column on the dataset with .I
, subset the data for thos 'sub_class' where 'sub_new_I' is 1, get the unique
rows by 'firm', 'year', 'subclass', then grouped by 'firm', 'sub_class', create the discounted_I column by taking the lag
of negative exp
of difference between the next 'year' ( lead
) and the 'year' divided by 5 ('newdf').使用
.I
在数据集上创建一个行索引列,将 'sub_new_I' 为 1 的 thos 'sub_class' 的数据子集,按 'firm'、'year'、'subclass' 获取unique
行,然后按 'firm' 分组, 'sub_class',通过将下一个'year' ( lead
) 和'year' 之间的差的负exp
的lag
除以 5 ('newdf') 来创建 discounted_I 列。 Then, do a join with the original data on
'rn'然后,与“rn”
on
的原始数据进行连接
df[, rn := .I]
newdf <- unique(df[df[, sub_class %in% sub_class[sub_new_I == 1]]],
by = c('firm', 'year', 'sub_class'))[, discounted_I :=
shift(exp(-(shift(year, type = 'lead') -
year)/5)), .(firm, sub_class)]
df[newdf, discounted_I := discounted_I, on = .(rn)]
-output -输出
firm year patent_number class sub_class sub_new_I rn discounted_I
1: A 1994 5505081 73 147 0 1 NA
2: A 1994 5505081 73 12.07 0 2 NA
3: A 1994 5606110 73 12.08 0 3 NA
4: A 1994 5606110 73 147 0 4 NA
5: A 1994 5837890 73 116.03 0 5 NA
6: A 1994 5837890 73 184 0 6 NA
7: A 1994 5837890 73 185 0 7 NA
8: A 1994 5837890 73 186 0 8 NA
9: A 1994 5837890 73 198 0 9 NA
10: A 1994 5837890 73 210 0 10 NA
11: A 2000 6725912 165 144 0 11 NA
12: A 2000 6725912 165 147 1 12 0.3011942
13: A 2000 6725912 165 140 0 13 NA
14: A 2002 6748800 73 147 1 14 0.6703200
15: A 2002 6748800 73 E29.272 0 15 NA
16: A 2002 6748800 73 E29.309 0 16 NA
17: A 2003 7153136 434 59 0 17 NA
18: A 2003 6997049 73 147 1 18 0.8187308
19: A 2003 6997049 73 E29.272 1 19 0.8187308
20: A 2003 6997049 73 E29.309 1 20 0.8187308
21: B 1975 4026555 463 3 0 21 NA
22: B 1975 4026555 463 168 0 22 NA
23: B 1975 4026555 463 473 0 23 NA
24: B 1975 4026555 463 960 0 24 NA
25: B 1975 4026555 463 552 0 25 NA
26: B 1975 4026555 463 31 0 26 NA
27: B 1976 4155095 348 701 0 27 NA
28: B 1976 4155095 348 593 0 28 NA
29: B 1977 4137556 361 91.2 0 29 NA
30: B 1977 4137556 361 72 0 30 NA
31: B 1977 4137556 361 58 0 31 NA
32: B 1977 4137556 361 59 1 32 NA
33: B 1977 4137556 361 111 0 33 NA
34: B 1977 4137556 361 222 0 34 NA
35: B 1977 4137556 361 93.05 0 35 NA
36: B 1977 4137556 361 117 0 36 NA
37: B 1977 4137556 361 709 0 37 NA
38: B 1978 4253157 707 104.1 0 38 NA
39: B 1978 4253157 707 93.25 0 39 NA
40: B 1978 4253157 707 552 1 40 0.5488116
An alternative simple idea.另一种简单的想法。
discounter_I
with fifelse
fifelse
计算discounter_I
diff.year
diff.year
df[,diff.year := year - lag(year),by=.(firm,sub_class)]
df[, discounted_I := fifelse(sub_new_I == 1,exp(-diff.year/5),0),
by=.(firm,sub_class)]
df[,diff.year:=NULL]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.