简体   繁体   中英

R - Extracting more than one row from a data frame based on an indicator in another column

I have a question about the extraction of multiple values from a data.frame in R based on an indicator

I have a data.frame that looks like this (df)

 ROW        COMPANY       PRICE      DATE          EVENT
  1         APPLE         1.50       Jan02           0
  2         APPLE         1.70       Feb02           1
  3         APPLE         1.65       Mar02           0
  4         APPLE         1.20       Apr02           0
  5         APPLE         1.30       May02           0
  6         APPLE         1.14       Jun02           0
  7         APPLE         1.10       Jul02           0
     .         .           .           .             .
     .         .           .           .             .
  349.997   MICROSOFT     0.80       Sep16           0
  349.998   MICROSOFT     0.65       Oct16           0
  349.999   MICROSOFT     1.10       Nov16           1
  350.000   MICROSOFT     0.90       Dez16           0

As you can see, i have a large data.frame containing various companies with their stock prices on given dates. Additionally i have an event column (only 0 and 1 as values). The Value 1 indicates that at the given date a specific event occured (eg shareholder meeting). Out of the 350.000 rows i have 2.500 events (that means Column Event has 2.500 ones and 347.500 zeros).

Now my goal is to analyze stock prices around specific events (eg analyze the stock prices 10 months before and 15 months after the event). Now to how i proceeded and where i am currently stuck.

First i have to split my data.frame based on my companies, because i need to get NAs if iam outside of my obervation period (2002-2016). eg if apple has an event in nov16 and i need to get the price 2 months after that, i should get an NA (because it is outside of my observation period), but in the unsplited data.frame i would get the price of the next companie from Jan02.

list<-split(df, f=df$COMPANY)

Now the part where i am stuck. i need to extract the 10 prices before and 15 prices after a event day for each company

The output i am trying to create would look like (Note: "?" = these values exist but they are not shown in the example df above)

     Event 1 (Apple)              Event 2500   (Microsoft)
-10      NA               ...         ?
 -9      NA               ...         ?
  .      .
  0     1.70              ...        1.10
  .      .
+15      ?                ...         NA

Sorry it is really hard to proper explain my problem without going to much into detail, but i hope that i could it made clear so some degree.

Thanks for the help :)

This can be accomplished with dplyr and tidyr packages, although it is a bit involved. Here is a gist on a much smaller dataset:

library(dplyr)
library(tidyr)
df <- readr::read_csv("COMPANY,PRICE,DATE,EVENT
APPLE,1.50,2002/01/01,0
APPLE,1.70,2002/02/01,1
APPLE,1.65,2002/03/01,0
APPLE,1.20,2002/04/01,0
MICROSOFT,2.50,2002/01/01,0
MICROSOFT,2.70,2002/02/01,0
MICROSOFT,2.65,2002/02/01,1
MICROSOFT,2.20,2002/03/01,0")
df
# A tibble: 8 x 4
COMPANY PRICE       DATE EVENT
<chr> <dbl>     <date> <int>
1     APPLE  1.50 2002-01-01     0
2     APPLE  1.70 2002-02-01     1
3     APPLE  1.65 2002-03-01     0
4     APPLE  1.20 2002-04-01     0
5 MICROSOFT  2.50 2002-01-01     0
6 MICROSOFT  2.70 2002-02-01     0
7 MICROSOFT  2.65 2002-02-01     1
8 MICROSOFT  2.20 2002-03-01     0

First, we need to construct some lags and leads. You will have to add more columns here if you want more pre/post event days.

with_lags <- df %>% 
  group_by(COMPANY) %>% 
  mutate(
    lag_01    = lag(PRICE,  n = 1, order_by = DATE)
    , lag_02  = lag(PRICE,  n = 2, order_by = DATE)
    , lag_00  = lag(PRICE,  n = 0, order_by = DATE)
    , lead_01 = lead(PRICE, n = 1, order_by = DATE)
    , lead_02 = lead(PRICE, n = 2, order_by = DATE)
  )
with_lags
# A tibble: 8 x 9
# Groups:   COMPANY [2]
COMPANY PRICE       DATE EVENT lag_01 lag_02 lag_00 lead_01 lead_02
<chr> <dbl>     <date> <int>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
1     APPLE  1.50 2002-01-01     0     NA     NA   1.50    1.70    1.65
2     APPLE  1.70 2002-02-01     1   1.50     NA   1.70    1.65    1.20
3     APPLE  1.65 2002-03-01     0   1.70    1.5   1.65    1.20      NA
4     APPLE  1.20 2002-04-01     0   1.65    1.7   1.20      NA      NA
5 MICROSOFT  2.50 2002-01-01     0     NA     NA   2.50    2.70    2.65
6 MICROSOFT  2.70 2002-02-01     0   2.50     NA   2.70    2.65    2.20
7 MICROSOFT  2.65 2002-02-01     1   2.70    2.5   2.65    2.20      NA
8 MICROSOFT  2.20 2002-03-01     0   2.65    2.7   2.20      NA      NA

Now we just keep rows where EVENT is 1, and reshuffle the data back into the long form. Note that you would have to edit the line that calls gather() function to reflect the list of lag/lead columns you constructed above:

long_form <- with_lags %>%
  filter(EVENT == 1) %>% 
  select(-PRICE, -EVENT, -DATE) %>% 
  gather(period, price, lag_01:lead_02) %>% 
  separate(period, c("lag_or_lead", "lag_order")) %>% 
  mutate(
    lag_order = ifelse(lag_or_lead == "lag", 
                       -1 * as.numeric(lag_order),
                       as.numeric(lag_order)) 
  ) %>% 
  select(-lag_or_lead) %>% 
  arrange(COMPANY, lag_order)
long_form
# A tibble: 10 x 3
# Groups:   COMPANY [2]
COMPANY lag_order price
<chr>     <dbl> <dbl>
1      APPLE        -2    NA
2      APPLE        -1  1.50
3      APPLE         0  1.70
4      APPLE         1  1.65
5      APPLE         2  1.20
6  MICROSOFT        -2  2.50
7  MICROSOFT        -1  2.70
8  MICROSOFT         0  2.65
9  MICROSOFT         1  2.20
10 MICROSOFT         2    NA

If you need this in wide form, you can then use spread() from tidyr package to move companies into columns.

I may be shot down for suggesting (shock horror) a loop to do this in base R, but IMHO code that is simple to understand and edit is often a preferable option to more concise but less comprehensible programming. With only 2500 events, I think it should be more than quick enough. It would be interesting if you could compare the speed of solutions with your real data?

set.seed(0)
SP <- data.frame(Company = c(rep_len("Apple", 50), 
                             rep_len("Microsoft", 50)),
                 Price = round(runif(100, 1, 2), 2),
                 Date = rep(seq.Date(from = as.Date("2002-01-01"), 
                                   length.out = 50, by = "month"),
                                    2),
                 Event = rbinom(100, 1, 0.05),
                 stringsAsFactors = FALSE)

Event <- which(SP$Event %in% 1)
resultFrame <- data.frame(Period = (-10):15)
for (i in Event){
  Stock <- SP$Company[i]
  eventTime <- format(SP$Date[i], "%b-%Y")
  stockWin <- (i - 10):(i + 15)
  stockWin[stockWin <= 0 | stockWin > nrow(SP)] <- NA
  stockWin[!(SP$Company[stockWin] %in% Stock)] <- NA
  priceWin <- SP[stockWin, "Price"]
  eventName <- paste("Event", eventTime, Stock, sep=".")
  resultFrame <- data.frame(resultFrame, priceWin)
  names(resultFrame)[ncol(resultFrame)] <- eventName
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM