简体   繁体   English

根据日期匹配变量并计算比率

[英]Matching on a variable, according to dates, and calculating ratios

I have a dataframe, lets call it df1, that looks something like this: 我有一个数据框,让我们称之为df1,看起来像这样:

month            product_key          price
201408           00020e32-a64715      75
201408           00020e32-a64715      75
201408           000340b8-bacac8      20
201408           000458f1-fdb6ae      45
201408           00083ebb-e9c17f      250
201408           00207e67-15a59f      480
201408           002777d7-50bec1      12
201408           002777d7-50bec1      12
201409           00020e32-a64715      75
201409           000340b8-bacac8      20
201409           00083ebb-e9c17f      250
201409           00207e67-15a59f      480
201409           00207e67-15a59f      480
201409           00207e67-15a59f      480
201410           00083ebb-e9c17f      250
201410           00207e67-15a59f      480
201410           00207e67-15a59f      480
201410           0020baff-9730f0      39.99
201411           00083ebb-e9c17f      250
201411           00207e67-15a59f      480
201412           00083ebb-e9c17f      250
201501           00083ebb-e9c17f      200
201501           0020baff-9730f0      29.99

There are other variables in the dataset but we don't need them for this purpose. 数据集中还有其他变量,但我们不需要此变量。 My dataset is monthly data and ranges from mid 2014 to late 2015. For each month there are hundreds of products and there can be the same product multiple times within the month. 我的数据集是月度数据,范围是2014年中至2015年末。每个月有数百种产品,一个月内可能有多次相同的产品。

What I want to do is identify products that appear in both say, August and September and remove the products that don't appear in both months. 我要做的是确定在8月和9月这两个月都出现的产品,并删除两个月都没有出现的产品。 Then I want to calculate an average of prices, of the remaining products, for each month. 然后,我想计算每个月剩余产品的平均价格。 Then I want to divide the average September price by the average August price. 然后,我想将9月的平均价格除以8月的平均价格。 In my dataframe this calculated figure would be the September index (August is defaulted to 1 as this is where the dataset begins). 在我的数据框中,此计算得出的数字将是九月份的索引(由于数据集的起始位置,八月默认为1)。

Then I would like to do the same for all the following months, so I would like to identify products that appear in both September and October, removing products that don't appear in both months, and calculate the average price (of the remaining products) for each month. 然后,我想在接下来的所有几个月中都做同样的事情,所以我想确定在9月和10月都出现的产品,删除在这两个月都没有出现的产品,然后计算(剩余产品的)平均价格)。 Then I want to divide the average October price by the average September price (which will be different to the previously calculated September average price as there will most likely be different products that appear in both September and October, compared with products that appear in both August and September). 然后,我想将10月的平均价格除以9月的平均价格(这与之前计算的9月的平均价格不同,因为与8月的产品相比,9月和10月的产品很可能会出现不同的产品和九月)。 This calculated figure would be the October index. 计算得出的数字将是十月份指数。 So I want to do this for all of the following months (October & November, November & December, December & January, January & February... and so on) 因此,我想在接下来的所有月份(十月和十一月,十一月和十二月,十二月和一月,一月和二月...等等)执行此操作

My resulting dataframe would ideally look something like this (using arbitrary numbers as the index): 理想情况下,我得到的数据帧将看起来像这样(使用任意数字作为索引):

month        index
201408       1
201409       1.0005      
201410       1.0152
201411       0.9997
201412       0.9551
201501       0.8985
201502       0.9754
201503       1.0045
201504       1.1520
201505       1.0148
201506       1.0452
201507       0.9945
201508       0.9751
201509       1.0004
201510       1.0415

When I have attempted this I end up matching products over the entire dataset and not over 2 consecutive months. 当我尝试这样做时,我最终会在整个数据集中而不是连续两个月内对产品进行匹配。 I can do this by breaking the dataset down into numerous datasets for each month but this seems long and tedious. 我可以通过将每个月的数据集分解为多个数据集来做到这一点,但这似乎很长且乏味。 I am sure there is a quicker way to do this? 我敢肯定有一种更快的方法吗?

You can use this code below to create a test dataset: 您可以在下面使用此代码创建测试数据集:

month <- c("201408", "201408", "201408", "201408", "201408", "201408", "201408", "201408", "201409", "201409", "201409", "201409", "201409", "201409", "201410", "201410", "201410", "201410", "201411", "201411", "201412", "201501", "201501")
product_key <- c("00020e32-a64715", "00020e32-a64715", "000340b8-bacac8", "000458f1-fdb6ae", "00083ebb-e9c17f", "00083ebb-e9c17f", "002777d7-50bec1", "002777d7-50bec1", "00020e32-a64715", "000340b8-bacac8", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "00207e67-15a59f", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "0020baff-9730f0", "00083ebb-e9c17f", "00207e67-15a59f", "00083ebb-e9c17f", "00083ebb-e9c17f", "0020baff-9730f0")
price <- c("75", "75", "20", "45", "250", "480", "12", "12", "75", "20", "250", "480", "480", "480", "250", "480", "480", "39.99", "250", "480", "250", "200", "29.99")
df1 <- data.frame(month, product_key, price)

To give an example of how I want this to work - here is what I did to create the index for August and September. 举一个我希望它如何工作的示例-这是我为八月和九月创建索引所做的工作。

DF1Aug <- DF1 %>%
  filter(month %in% "201408") %>%
  group_by(product_key) %>%
  summarize(aveprice=mean(price))


DF1Sept <- DF1 %>%
  filter(month %in% "201409") %>%
  group_by(product_key) %>%
  summarize(aveprice=mean(price))


SeptPriceIndex <- transform(merge(DF1Aug, DF1Sept, by=c("product_key"), suffixes=c("_Aug", "_Sept"))) %>%
            mutate(AugAvgPrice=mean(aveprice_Aug)) %>%
            mutate(SeptAvgPrice=mean(aveprice_Sept)) %>%
            mutate(priceIndex = SeptAvgPrice/AugAvgPrice)

However, this is obviously a tedious process to do this for about the 20 months I have in the dataframe (and I need to do this on more than one dataframe) so I would like to find a way to automate it. 但是,这显然是一个繁琐的过程,要在数据框中使用大约20个月(而且我需要在多个数据框中执行此操作),所以我想找到一种自动化的方法。

Something like the following could work using dplyr and tidy (updated): 可以使用dplyrtidy (更新)执行以下操作:

df %>% 
  # ensure data is sorted so that months are sequential by product key:
  arrange(product_key, month) %>% 
  # ensure every product month combo exists:
  complete(product_key, month) %>%  
  # create a sequential id within each product:
  group_by(product_key) %>% 
  mutate(grp_seq = row_number()) %>% 
  # remove product / month pairs without a price:
  filter(!is.na(price)) %>%
  # remove product keys that appear in only one month:
  filter(n_distinct(month) > 1) %>% 
  # remove non-consecutive product / month pairs:
  filter(lead(grp_seq) - 1 == grp_seq | lag(grp_seq) + 1 == grp_seq) %>% 
  # summarize the average price by month:
  group_by(month) %>% 
  summarize(avg_price = mean(as.numeric(price))) %>%
  # calculate the price index:
  mutate(index_price = avg_price / lag(avg_price)) 

# A tibble: 6 x 3
  month  avg_price index_price
  <chr>      <dbl>       <dbl>
1 201408      180.      NA    
2 201409      298.       1.65 
3 201410      403.       1.36 
4 201411      365.       0.905
5 201412      250.       0.685
6 201501      200.       0.800  

The OP wants to get the price index for two subsequent months by computing an average of all recorded prices across all recurrent products and by dividing the average monthly prices. OP希望通过计算所有经常性产品的所有已记录价格的平均值并除以平均每月价格来获取随后两个月的价格指数。

It might be that this what the OP intends but I am not convinced that this is the correct approach: OP可能打算这样做,但我不确信这是正确的方法:

  1. According to the OP there can be the same product multiple times within the month . 根据OP ,一个月内可以多次出现相同的产品 So, if one product has more recorded prices than other products it will have a greater impact on the average monthly price and hence the price index. 因此,如果一种产品的记录价格高于其他产品,它将对平均每月价格和价格指数产生更大的影响。
  2. Products with higher prices will dominate the average monthly price. 价格较高的产品将主导平均每月价格。 So, price changes of cheaper products will be less visible in the price index. 因此,较便宜产品的价格变化在价格指数中将不太明显。

Example

Here is made-up example to explain what I mean. 这是一个虚构的例子来解释我的意思。 Let's assume we have two products. 假设我们有两种产品。 Product A is expensive and has two recorded prices in April but there is no price change in May. 产品A价格昂贵,4月份有两个记录价格,但5月份价格没有变化。 Product B is cheap but its price has doubled in May. 产品B价格便宜,但5月份价格翻了一番 So, my expectation is that the price index will reflect this increase . 因此,我期望价格指数将反映出这种增长

library(data.table)
example <- fread(
  "month   product_key price
  201704   A           90
  201704   A           110
  201704   B           1
  201705   A           100
  201705   B           2")

# OP's approach
example[, .(avg_price = mean(price)), by = month][
  , price_index := avg_price / shift(avg_price)][]
  month avg_price price_index 1: 201704 67 NA 2: 201705 51 0.761194 

So, according to OP's approach the price index has dropped . 因此,按照OP的方法,价格指数下降了

I believe the correct approach is 我相信正确的方法是

  1. to compute the average monthly price for each product 计算每种产品的平均每月价格
  2. to compute the price index for each product in subsequent months 计算随后几个月每种产品的价格指数
  3. to compute the average price index across products for each month 计算每个月每个产品的平均价格指数

(I apologize for using data.table syntax which I am more acquainted with. I have tried to use dplyr syntax but it took me too much time.) (我很抱歉使用我更熟悉的data.table语法。我曾尝试使用dplyr语法,但是这花了我太多时间。)

# compute average monthly price for each product
tmp1 <- example[, .(avg_price = mean(price)), keyby = .(product_key, month)]
tmp1
  product_key month avg_price 1: A 201704 100 2: A 201705 100 3: B 201704 1 4: B 201705 2 
# compute price index for each product
tmp2 <- tmp1[, price_index := avg_price / shift(avg_price), by = product_key][]
tmp2
  product_key month avg_price price_index 1: A 201704 100 NA 2: A 201705 100 1 3: B 201704 1 NA 4: B 201705 2 2 
# compute average price index
tmp2[, .(avg_price_index = mean(price_index, na.rm = TRUE)), by = month]
  month avg_price_index 1: 201704 NaN 2: 201705 1.5 

Now, the price index shows an increase according to my expectations (which might not be the OP's). 现在,价格指数显示出符合我的期望的增长(可能不是OP的期望值)。

Compute price index for several month 计算几个月的价格指数

The OP has requested to compute the price index for several months but only for products which appear in subsequent months. OP要求计算几个月的价格指数,但仅针对随后几个月出现的产品。 This can be solved by a self join with shifted months. 这可以通过每月轮换的自我连接来解决。

Note that a simple lag() or shift() is dangerous here because it relies on row order and will fail if months are missing. 请注意,这里简单的lag()shift()很危险,因为它依赖于行顺序,如果缺少月份则将失败。 Therefore, date arithmetic is used to find the correct subsequent month. 因此,日期算术用于查找正确的后续月份。

The sef join approach has the additional benefit that only recurrent products are considered. sef join方法的另一个好处是仅考虑循环产品。 If a product_key has no match in the subsequent month, price will be NA . 如果product_key在下个月不匹配,则priceNA Those entries will be dropped when calculating the average price index. 在计算平均价格指数时,这些条目将被删除。

library(data.table)
library(magrittr)
DF2 <- setDT(DF1)[
  # convert price from factor to numeric
  , price := price %>% as.character() %>% as.numeric()][
    # convert character month to Date
    , month := month %>% lubridate::ymd(truncated = 1L)][
      # compute average monthly price for each product
      , .(avg_price = mean(price)), keyby = .(product_key, month)]

# self join with subsequent month 
DF2[DF2[, .(product_key, month = month + months(1), avg_price)],
    on = .(product_key, month)][
      # compute price index for each product
      , price_index := avg_price / i.avg_price][
        # compute average price index
        , .(avg_price_index = mean(price_index, na.rm = TRUE)), by = month]
  month avg_price_index 1: 2014-09-01 0.8949772 2: 2014-10-01 1.0000000 3: 2014-11-01 1.0000000 4: 2014-12-01 1.0000000 5: 2015-01-01 0.8000000 6: 2015-02-01 NaN 

Data 数据

As provided by the OP 由OP提供

month <- c("201408", "201408", "201408", "201408", "201408", "201408", "201408", "201408", "201409", "201409", "201409", "201409", "201409", "201409", "201410", "201410", "201410", "201410", "201411", "201411", "201412", "201501", "201501")
product_key <- c("00020e32-a64715", "00020e32-a64715", "000340b8-bacac8", "000458f1-fdb6ae", "00083ebb-e9c17f", "00083ebb-e9c17f", "002777d7-50bec1", "002777d7-50bec1", "00020e32-a64715", "000340b8-bacac8", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "00207e67-15a59f", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "0020baff-9730f0", "00083ebb-e9c17f", "00207e67-15a59f", "00083ebb-e9c17f", "00083ebb-e9c17f", "0020baff-9730f0")
price <- c("75", "75", "20", "45", "250", "480", "12", "12", "75", "20", "250", "480", "480", "480", "250", "480", "480", "39.99", "250", "480", "250", "200", "29.99")
DF1 <- data.frame(month, product_key, price)

Note that all columns are factors. 请注意,所有列都是因素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM