简体   繁体   English

基于之前出现在 r 中的折扣值

[英]Discounting values based on previous appearance in r

I have the following (sample) dataset:我有以下(示例)数据集:

df <- data.table(
  firm = rep(c("A","B"),each=20),
  year = c(1994,1994,1994,1994,1994,1994,1994,1994,1994,1994,2000,2000,2000,2002,2002,2002,2003,2003,2003,2003,1975,1975,1975,1975,1975,1975,1976,1976,1977,1977,1977,1977,1977,1977,1977,1977,1977,1978,1978,1978),
  patent_number=c(5505081,5505081,5606110,5606110,5837890,5837890,5837890,5837890,5837890,5837890,6725912,6725912,6725912,6748800,6748800,6748800,7153136,6997049,6997049,6997049,4026555,4026555,4026555,4026555,4026555,4026555,4155095,4155095,4137556,4137556,4137556,4137556,4137556,4137556,4137556,4137556,4137556,4253157,4253157,4253157),
  class=c("73","73","73","73","73","73","73","73","73","73","165","165","165","73","73","73","434","73","73","73","463","463","463","463","463","463","348","348","361","361","361","361","361","361","361","361","361","707","707","707"),
  sub_class=c("147","12.07","12.08","147","116.03","184","185","186","198","210","144",
                             "147","140","147","E29.272","E29.309","59","147","E29.272","E29.309","3","168",
                             "473","960","552","31","701","593","91.2","72","58","59","111",
                             "222","93.05","117","709","104.1","93.25","552"),
  sub_new_I = c(0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1)
  )
> df
   firm year patent_number class sub_class sub_new_I
 1:    A 1994       5505081    73       147         0
 2:    A 1994       5505081    73     12.07         0
 3:    A 1994       5606110    73     12.08         0
 4:    A 1994       5606110    73       147         0
 5:    A 1994       5837890    73    116.03         0
 6:    A 1994       5837890    73       184         0
 7:    A 1994       5837890    73       185         0
 8:    A 1994       5837890    73       186         0
 9:    A 1994       5837890    73       198         0
10:    A 1994       5837890    73       210         0
11:    A 2000       6725912   165       144         0
12:    A 2000       6725912   165       147         1
13:    A 2000       6725912   165       140         0
14:    A 2002       6748800    73       147         1
15:    A 2002       6748800    73   E29.272         0
16:    A 2002       6748800    73   E29.309         0
17:    A 2003       7153136   434        59         0
18:    A 2003       6997049    73       147         1
19:    A 2003       6997049    73   E29.272         1
20:    A 2003       6997049    73   E29.309         1
21:    B 1975       4026555   463         3         0
22:    B 1975       4026555   463       168         0
23:    B 1975       4026555   463       473         0
24:    B 1975       4026555   463       960         0
25:    B 1975       4026555   463       552         0
26:    B 1975       4026555   463        31         0
27:    B 1976       4155095   348       701         0
28:    B 1976       4155095   348       593         0
29:    B 1977       4137556   361      91.2         0
30:    B 1977       4137556   361        72         0
31:    B 1977       4137556   361        58         0
32:    B 1977       4137556   361        59         0
33:    B 1977       4137556   361       111         0
34:    B 1977       4137556   361       222         0
35:    B 1977       4137556   361     93.05         0
36:    B 1977       4137556   361       117         0
37:    B 1977       4137556   361       709         0
38:    B 1978       4253157   707     104.1         0
39:    B 1978       4253157   707     93.25         0
40:    B 1978       4253157   707       552         1

The column sub_new_I indicates whether a firm has previously filed a patent in that specific subclass (by a binary indicator). sub_new_I列表示公司之前是否曾在该特定子类中提交过专利(通过二元指标)。

My goal is to apply a time discount to decrease the value of more distant history.我的目标是应用时间折扣来降低更远历史的价值。 The logic is that a subclass that appears 10 years ago is different from the one that was filed 2 years ago.逻辑是 10 年前出现的子类与 2 年前提交的子类不同。 To do this, I want to rely on an exponential discount with a power less than 1, explained as follows: For example, Firm A has filed a patent (with patent_number = 6725912) in 2000 which includes subclass 147. The same subclass appears in year 1994 as well ( patent_number =5606110).为此,我想依靠幂小于 1 的指数折扣,解释如下:例如,公司 A 在 2000 年申请了专利(专利号 = patent_number ),其中包括子类 147。相同的子类出现在1994 年也是如此(专利号 = patent_number )。 The indicator ( sub_class_I ) is therefore equal to 1 in year 2000. But applying the time discount, I want the indicator to be exp(-( (2000-1994)/5 ) )=0.301.因此,指标 ( sub_class_I ) 在 2000 年等于 1。但是应用时间折扣,我希望指标为 exp(-( (2000-1994)/5 ) )=0.301。 Essentially, the discount factor is given by this formula:本质上,折扣因子由以下公式给出:

公式

Here, the constant for knowledge loss is equal to 5. (This means that the denominator is equal to 5 in all calculations).这里,知识损失的常数等于 5。(这意味着分母在所有计算中都等于 5)。

The desired output is, therefore:因此,所需的 output 是:

> df
    firm year patent_number class sub_class sub_new_I discounted_I
 1:    A 1994       5505081    73       147         0       0.0000
 2:    A 1994       5505081    73     12.07         0       0.0000
 3:    A 1994       5606110    73     12.08         0       0.0000
 4:    A 1994       5606110    73       147         0       0.0000
 5:    A 1994       5837890    73    116.03         0       0.0000
 6:    A 1994       5837890    73       184         0       0.0000
 7:    A 1994       5837890    73       185         0       0.0000
 8:    A 1994       5837890    73       186         0       0.0000
 9:    A 1994       5837890    73       198         0       0.0000
10:    A 1994       5837890    73       210         0       0.0000
11:    A 2000       6725912   165       144         0       0.0000
12:    A 2000       6725912   165       147         1       0.3012
13:    A 2000       6725912   165       140         0       0.0000
14:    A 2002       6748800    73       147         1       0.6703
15:    A 2002       6748800    73   E29.272         0       0.0000
16:    A 2002       6748800    73   E29.309         0       0.0000
17:    A 2003       7153136   434        59         0       0.0000
18:    A 2003       6997049    73       147         1       0.8187
19:    A 2003       6997049    73   E29.272         1       0.8187
20:    A 2003       6997049    73   E29.309         1       0.8187
21:    B 1975       4026555   463         3         0       0.0000
22:    B 1975       4026555   463       168         0       0.0000
23:    B 1975       4026555   463       473         0       0.0000
24:    B 1975       4026555   463       960         0       0.0000
25:    B 1975       4026555   463       552         0       0.0000
26:    B 1975       4026555   463        31         0       0.0000
27:    B 1976       4155095   348       701         0       0.0000
28:    B 1976       4155095   348       593         0       0.0000
29:    B 1977       4137556   361      91.2         0       0.0000
30:    B 1977       4137556   361        72         0       0.0000
31:    B 1977       4137556   361        58         0       0.0000
32:    B 1977       4137556   361        59         0       0.0000
33:    B 1977       4137556   361       111         0       0.0000
34:    B 1977       4137556   361       222         0       0.0000
35:    B 1977       4137556   361     93.05         0       0.0000
36:    B 1977       4137556   361       117         0       0.0000
37:    B 1977       4137556   361       709         0       0.0000
38:    B 1978       4253157   707     104.1         0       0.0000
39:    B 1978       4253157   707     93.25         0       0.0000
40:    B 1978       4253157   707       552         1       0.5488

My dataset is big, so a data.table solution greatly appreciated.我的数据集很大,因此非常感谢data.table解决方案。

Thanks in advance for any help.提前感谢您的帮助。

Create a row index column on the dataset with .I , subset the data for thos 'sub_class' where 'sub_new_I' is 1, get the unique rows by 'firm', 'year', 'subclass', then grouped by 'firm', 'sub_class', create the discounted_I column by taking the lag of negative exp of difference between the next 'year' ( lead ) and the 'year' divided by 5 ('newdf').使用.I在数据集上创建一个行索引列,将 'sub_new_I' 为 1 的 thos 'sub_class' 的数据子集,按 'firm'、'year'、'subclass' 获取unique行,然后按 'firm' 分组, 'sub_class',通过将下一个'year' ( lead ) 和'year' 之间的差的负explag除以 5 ('newdf') 来创建 discounted_I 列。 Then, do a join with the original data on 'rn'然后,与“rn” on的原始数据进行连接

df[, rn := .I]
newdf <- unique(df[df[, sub_class %in% sub_class[sub_new_I == 1]]],
     by = c('firm', 'year', 'sub_class'))[,  discounted_I := 
         shift(exp(-(shift(year, type = 'lead') - 
           year)/5)), .(firm, sub_class)]
df[newdf, discounted_I := discounted_I, on = .(rn)]

-output -输出

   firm year patent_number class sub_class sub_new_I rn discounted_I
 1:    A 1994       5505081    73       147         0  1           NA
 2:    A 1994       5505081    73     12.07         0  2           NA
 3:    A 1994       5606110    73     12.08         0  3           NA
 4:    A 1994       5606110    73       147         0  4           NA
 5:    A 1994       5837890    73    116.03         0  5           NA
 6:    A 1994       5837890    73       184         0  6           NA
 7:    A 1994       5837890    73       185         0  7           NA
 8:    A 1994       5837890    73       186         0  8           NA
 9:    A 1994       5837890    73       198         0  9           NA
10:    A 1994       5837890    73       210         0 10           NA
11:    A 2000       6725912   165       144         0 11           NA
12:    A 2000       6725912   165       147         1 12    0.3011942
13:    A 2000       6725912   165       140         0 13           NA
14:    A 2002       6748800    73       147         1 14    0.6703200
15:    A 2002       6748800    73   E29.272         0 15           NA
16:    A 2002       6748800    73   E29.309         0 16           NA
17:    A 2003       7153136   434        59         0 17           NA
18:    A 2003       6997049    73       147         1 18    0.8187308
19:    A 2003       6997049    73   E29.272         1 19    0.8187308
20:    A 2003       6997049    73   E29.309         1 20    0.8187308
21:    B 1975       4026555   463         3         0 21           NA
22:    B 1975       4026555   463       168         0 22           NA
23:    B 1975       4026555   463       473         0 23           NA
24:    B 1975       4026555   463       960         0 24           NA
25:    B 1975       4026555   463       552         0 25           NA
26:    B 1975       4026555   463        31         0 26           NA
27:    B 1976       4155095   348       701         0 27           NA
28:    B 1976       4155095   348       593         0 28           NA
29:    B 1977       4137556   361      91.2         0 29           NA
30:    B 1977       4137556   361        72         0 30           NA
31:    B 1977       4137556   361        58         0 31           NA
32:    B 1977       4137556   361        59         1 32           NA
33:    B 1977       4137556   361       111         0 33           NA
34:    B 1977       4137556   361       222         0 34           NA
35:    B 1977       4137556   361     93.05         0 35           NA
36:    B 1977       4137556   361       117         0 36           NA
37:    B 1977       4137556   361       709         0 37           NA
38:    B 1978       4253157   707     104.1         0 38           NA
39:    B 1978       4253157   707     93.25         0 39           NA
40:    B 1978       4253157   707       552         1 40    0.5488116

An alternative simple idea.另一种简单的想法。

  1. group by firm and sub_class to claculate the year difference with previous appearance按公司和子类分组以计算与以前外观的年份差异
  2. calculate discounter_I with fifelsefifelse计算discounter_I
  3. delete temporary diff.year删除临时diff.year
df[,diff.year := year - lag(year),by=.(firm,sub_class)]
df[, discounted_I := fifelse(sub_new_I == 1,exp(-diff.year/5),0),
   by=.(firm,sub_class)]
df[,diff.year:=NULL]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM