简体   繁体   English

R-根据另一列中的指示符从数据框中提取多行

[英]R - Extracting more than one row from a data frame based on an indicator in another column

I have a question about the extraction of multiple values from a data.frame in R based on an indicator 我有一个关于基于指标从R中的data.frame中提取多个值的问题

I have a data.frame that looks like this (df) 我有一个看起来像这样的data.frame(df)

 ROW        COMPANY       PRICE      DATE          EVENT
  1         APPLE         1.50       Jan02           0
  2         APPLE         1.70       Feb02           1
  3         APPLE         1.65       Mar02           0
  4         APPLE         1.20       Apr02           0
  5         APPLE         1.30       May02           0
  6         APPLE         1.14       Jun02           0
  7         APPLE         1.10       Jul02           0
     .         .           .           .             .
     .         .           .           .             .
  349.997   MICROSOFT     0.80       Sep16           0
  349.998   MICROSOFT     0.65       Oct16           0
  349.999   MICROSOFT     1.10       Nov16           1
  350.000   MICROSOFT     0.90       Dez16           0

As you can see, i have a large data.frame containing various companies with their stock prices on given dates. 如您所见,我有一个很大的data.frame,其中包含各个公司的股票以及给定日期的股票价格。 Additionally i have an event column (only 0 and 1 as values). 另外我有一个事件列(仅0和1作为值)。 The Value 1 indicates that at the given date a specific event occured (eg shareholder meeting). 值1表示在给定的日期发生了特定事件(例如,股东大会)。 Out of the 350.000 rows i have 2.500 events (that means Column Event has 2.500 ones and 347.500 zeros). 在350.000行中,我有2.500个事件(这意味着列事件有2.500个1和347.500个零)。

Now my goal is to analyze stock prices around specific events (eg analyze the stock prices 10 months before and 15 months after the event). 现在,我的目标是分析特定事件的股价(例如,分析事件发生前10个月和事件发生后15个月的股价)。 Now to how i proceeded and where i am currently stuck. 现在,我要如何进行以及当前处于何处。

First i have to split my data.frame based on my companies, because i need to get NAs if iam outside of my obervation period (2002-2016). 首先,我必须根据我的公司拆分我的data.frame,因为如果我不在观察期(2002-2016年)内,我需要获取NA。 eg if apple has an event in nov16 and i need to get the price 2 months after that, i should get an NA (because it is outside of my observation period), but in the unsplited data.frame i would get the price of the next companie from Jan02. 例如,如果苹果在nov16有一个事件,并且我需要在此之后2个月获得价格,我应该获得NA(因为它超出了我的观察期),但是在未拆分的data.frame中,我将获得价格Jan02的下一个伙伴。

list<-split(df, f=df$COMPANY)

Now the part where i am stuck. 现在我卡住的部分。 i need to extract the 10 prices before and 15 prices after a event day for each company 我需要为每个公司提取活动日之前的10个价格和活动日之后的15个价格

The output i am trying to create would look like (Note: "?" = these values exist but they are not shown in the example df above) 我尝试创建的输出看起来像(注意:“?” =这些值存在,但在上面的示例df中未显示)

     Event 1 (Apple)              Event 2500   (Microsoft)
-10      NA               ...         ?
 -9      NA               ...         ?
  .      .
  0     1.70              ...        1.10
  .      .
+15      ?                ...         NA

Sorry it is really hard to proper explain my problem without going to much into detail, but i hope that i could it made clear so some degree. 抱歉,在不进行详细介绍的情况下,很难正确地解释我的问题,但是我希望我能在一定程度上阐明这一点。

Thanks for the help :) 谢谢您的帮助 :)

This can be accomplished with dplyr and tidyr packages, although it is a bit involved. 尽管有点tidyr ,但可以使用dplyrtidyr软件包来完成。 Here is a gist on a much smaller dataset: 这是一个小得多的数据集的要点:

library(dplyr)
library(tidyr)
df <- readr::read_csv("COMPANY,PRICE,DATE,EVENT
APPLE,1.50,2002/01/01,0
APPLE,1.70,2002/02/01,1
APPLE,1.65,2002/03/01,0
APPLE,1.20,2002/04/01,0
MICROSOFT,2.50,2002/01/01,0
MICROSOFT,2.70,2002/02/01,0
MICROSOFT,2.65,2002/02/01,1
MICROSOFT,2.20,2002/03/01,0")
df
# A tibble: 8 x 4
COMPANY PRICE       DATE EVENT
<chr> <dbl>     <date> <int>
1     APPLE  1.50 2002-01-01     0
2     APPLE  1.70 2002-02-01     1
3     APPLE  1.65 2002-03-01     0
4     APPLE  1.20 2002-04-01     0
5 MICROSOFT  2.50 2002-01-01     0
6 MICROSOFT  2.70 2002-02-01     0
7 MICROSOFT  2.65 2002-02-01     1
8 MICROSOFT  2.20 2002-03-01     0

First, we need to construct some lags and leads. 首先,我们需要构建一些滞后和领先。 You will have to add more columns here if you want more pre/post event days. 如果您想要更多的活动前/活动日,则必须在此处添加更多列。

with_lags <- df %>% 
  group_by(COMPANY) %>% 
  mutate(
    lag_01    = lag(PRICE,  n = 1, order_by = DATE)
    , lag_02  = lag(PRICE,  n = 2, order_by = DATE)
    , lag_00  = lag(PRICE,  n = 0, order_by = DATE)
    , lead_01 = lead(PRICE, n = 1, order_by = DATE)
    , lead_02 = lead(PRICE, n = 2, order_by = DATE)
  )
with_lags
# A tibble: 8 x 9
# Groups:   COMPANY [2]
COMPANY PRICE       DATE EVENT lag_01 lag_02 lag_00 lead_01 lead_02
<chr> <dbl>     <date> <int>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
1     APPLE  1.50 2002-01-01     0     NA     NA   1.50    1.70    1.65
2     APPLE  1.70 2002-02-01     1   1.50     NA   1.70    1.65    1.20
3     APPLE  1.65 2002-03-01     0   1.70    1.5   1.65    1.20      NA
4     APPLE  1.20 2002-04-01     0   1.65    1.7   1.20      NA      NA
5 MICROSOFT  2.50 2002-01-01     0     NA     NA   2.50    2.70    2.65
6 MICROSOFT  2.70 2002-02-01     0   2.50     NA   2.70    2.65    2.20
7 MICROSOFT  2.65 2002-02-01     1   2.70    2.5   2.65    2.20      NA
8 MICROSOFT  2.20 2002-03-01     0   2.65    2.7   2.20      NA      NA

Now we just keep rows where EVENT is 1, and reshuffle the data back into the long form. 现在,我们只保留EVENT为1的行,然后将数据重新洗回到长格式。 Note that you would have to edit the line that calls gather() function to reflect the list of lag/lead columns you constructed above: 请注意,您必须编辑调用gather()函数的行,以反映您在上面构造的滞后/超前列的列表:

long_form <- with_lags %>%
  filter(EVENT == 1) %>% 
  select(-PRICE, -EVENT, -DATE) %>% 
  gather(period, price, lag_01:lead_02) %>% 
  separate(period, c("lag_or_lead", "lag_order")) %>% 
  mutate(
    lag_order = ifelse(lag_or_lead == "lag", 
                       -1 * as.numeric(lag_order),
                       as.numeric(lag_order)) 
  ) %>% 
  select(-lag_or_lead) %>% 
  arrange(COMPANY, lag_order)
long_form
# A tibble: 10 x 3
# Groups:   COMPANY [2]
COMPANY lag_order price
<chr>     <dbl> <dbl>
1      APPLE        -2    NA
2      APPLE        -1  1.50
3      APPLE         0  1.70
4      APPLE         1  1.65
5      APPLE         2  1.20
6  MICROSOFT        -2  2.50
7  MICROSOFT        -1  2.70
8  MICROSOFT         0  2.65
9  MICROSOFT         1  2.20
10 MICROSOFT         2    NA

If you need this in wide form, you can then use spread() from tidyr package to move companies into columns. 如果您需要广泛的格式,则可以使用tidyr包中的spread()将公司移到列中。

I may be shot down for suggesting (shock horror) a loop to do this in base R, but IMHO code that is simple to understand and edit is often a preferable option to more concise but less comprehensible programming. 我可能会因为建议在基R中执行此操作而感到震惊(震撼),但是易于理解和编辑的IMHO代码通常是更简洁但较难理解的编程的首选。 With only 2500 events, I think it should be more than quick enough. 我认为只有2500个事件,应该足够快了。 It would be interesting if you could compare the speed of solutions with your real data? 如果您可以将解决方案的速度与实际数据进行比较,那将很有趣?

set.seed(0)
SP <- data.frame(Company = c(rep_len("Apple", 50), 
                             rep_len("Microsoft", 50)),
                 Price = round(runif(100, 1, 2), 2),
                 Date = rep(seq.Date(from = as.Date("2002-01-01"), 
                                   length.out = 50, by = "month"),
                                    2),
                 Event = rbinom(100, 1, 0.05),
                 stringsAsFactors = FALSE)

Event <- which(SP$Event %in% 1)
resultFrame <- data.frame(Period = (-10):15)
for (i in Event){
  Stock <- SP$Company[i]
  eventTime <- format(SP$Date[i], "%b-%Y")
  stockWin <- (i - 10):(i + 15)
  stockWin[stockWin <= 0 | stockWin > nrow(SP)] <- NA
  stockWin[!(SP$Company[stockWin] %in% Stock)] <- NA
  priceWin <- SP[stockWin, "Price"]
  eventName <- paste("Event", eventTime, Stock, sep=".")
  resultFrame <- data.frame(resultFrame, priceWin)
  names(resultFrame)[ncol(resultFrame)] <- eventName
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从R中的数据框中提取行和列 - Extracting row and column from data frame in R 通过根据另一个数据框中列的值从一个数据框中提取列来创建新数据框 - creating a new data frame by extracting columns from one data frame based on the value of column in another data frame R:从一个数据框中提取行,基于列名匹配另一个数据框中的值 - R: Extract Rows from One Data Frame, Based on Column Names Matching Values from Another Data Frame R:根据另一列操作一个数据框列的值 - R: Manipulate values of one data frame column based on another column 从R中的数据框中提取一行 - Extracting a row from a data frame in R 从 R data.frame 中的另一行中减去一行 - Subtract one row from another row in an R data.frame 根据条件从R中的另一个数据表中提取列值 - Extracting column values based on condition from another data table in R R-根据行匹配,使用来自另一个数据框的值填充一个数据框 - R - Populate one data frame with values from another dataframe, based on row matching 使用一个数据帧作为掩码从R中的另一个数据帧提取数据 - Using one data frame as a mask for extracting data from another data frame in R 如何根据R中的一列列表将一个数据框中的值汇总到另一个数据框中 - How to sum values from one data frame into another based on a column of lists in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM