简体   繁体   中英

R - Extract multiple rows from column 1 if certain value appears in column 2

I have a question about the extraction of multiple values from a data.frame in R and putting them into a new data.frame.

I have a data.frame that looks like this (df)

PRICE     EVENT
1.50        0
1.70        0
1.65        0
1.20        1
0.90        0
1.70        0
1.55        0 
  .         .
  .         .
1.10        0
1.20        0
1.14        1
0.90        0

My actual data.frame has these two columns and over 300.000 rows. The column called EVENT only has the values 0 OR 1 (the value 1 is a proxy that a certain event occurs).

First Step of my research: Analyze the price if the Event occurs. The first step is a easy one. I did it with

vector<-df[df$EVENT==1, "PRICE"]

now vector contains all the Prices for the Eventdays. (here: 1.20 and 1.14)

but now the second step of my research is where it gets interesting:

now i want not only the prices for the eventday, but also the prices for x days before and after the eventday and put them into a matrix

For Example: I want the prices of two days before the event and one day after the event (including event day)

than the new data.frame i am trying to create would look like

    Event 1               Event n
-2   1.70        ...        1.10
-1   1.65        ...        1.20
 0   1.20        ...        1.14
+1   0.90        ...        0.90

Please keep in mind that the 4 days span [-2:1] is only an example. In my actual research i have to cover a 91 day span [-30:60].

Thanks for the help :)

We can create a matrix that contains the relevant row numbers, and then use that as a mask to arrive at your expected output:

event_rows <- which(df$EVENT==1)
mask <- sapply(event_rows, function(x) (x-2):(x+2))
apply(mask, 2, function(x) df$PRICE[x])
#     [,1] [,2]
#[1,] 1.70 1.10
#[2,] 1.65 1.20
#[3,] 1.20 1.14
#[4,] 0.90 0.90
#[5,] 1.70   NA

Data

df <- structure(list(PRICE = c(1.5, 1.7, 1.65, 1.2, 0.9, 1.7, 1.55, 
1.1, 1.2, 1.14, 0.9), EVENT = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 
0L, 1L, 0L)), .Names = c("PRICE", "EVENT"), class = "data.frame", row.names = c(NA, 
-11L))

For the sake of completion, here's a base R solution:

# example data
set.seed(123)
df <- data.frame(price = rnorm(100), event = rbinom(100, 1, 0.05))

# create a vector of unique event positions with additional 2 positions before and 1 ahead
offset <- unique(as.vector(sapply(which(df$event == 1), function(x) c((x-2):(x+1)))))

# subset data    
df[offset[offset >0 & offset <= 100],]


         price event
1  -0.56047565     0
2  -0.23017749     1
3   1.55870831     0
20 -0.47279141     0
21 -1.06782371     0
22 -0.21797491     1
23 -1.02600445     0
46 -1.12310858     0
47 -0.40288484     0
48 -0.46665535     1
49  0.77996512     1
50 -0.08336907     0
62 -0.50232345     0
63 -0.33320738     0
64 -1.01857538     1
65 -1.07179123     0
75 -0.68800862     0
76  1.02557137     0
77 -0.28477301     1
78 -1.22071771     0
95  1.36065245     0
96 -0.60025959     0
97  2.18733299     1
98  1.53261063     0

Edit: I didn't see the expected output at first, see @mtoto's answer for that.

What I would do is, extend the base data data frame with the lags, and then select by rows. Using the tidyverse it would be something like this. (I strongly recommend using the tidyverse rather than base R. But that is up to you)

library(tidyverse)

# generate example data frame

df <- data.frame(price = rnorm(100), event = rbinom(100, 1, 0.5))

# generate a vector from one the desired number of lags.
# we map this vector with a function that returns the lagged
# values of the price. then we join by columns
lags <- map(1:3, function(x){lag(df$price, n = x)}) %>%
    reduce(cbind) %>% as.data.frame %>% 
    set_names(paste('priceLag', 1:3, sep = ''))

# bind lags to original data frame, select events == 1
out <- cbind(df, lags) %>% filter(df$event == 1)
library('tidyverse')


df <- data.frame(
  price = seq_len(20),
  event = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0))
df
#    price event
# 1      1     0
# 2      2     0
# 3      3     0
# 4      4     0
# 5      5     1
# 6      6     0
# 7      7     0
# 8      8     0
# 9      9     0
# 10    10     0
# 11    11     0
# 12    12     1
# 13    13     0
# 14    14     0
# 15    15     0
# 16    16     1
# 17    17     1
# 18    18     0
# 19    19     0
# 20    20     0

You can use lag and lead to get the offset values. Then use a combination of gather and spread to flip the data frame to the desired shape.

df %>%
  mutate(
    `-2` = lag(price, 2),
    `-1` = lag(price),
    `0` = price,
    `+1` = lead(price)) %>%
  select(-price) %>%
  filter(event == 1) %>%
  mutate(event = paste0('event_', seq_along(event))) %>%
  gather('offset', 'value', -event) %>%
  spread(event, value) %>%
  arrange(as.numeric(offset))
#   offset event_1 event_2 event_3 event_4
# 1     -2       3      10      14      15
# 2     -1       4      11      15      16
# 3      0       5      12      16      17
# 4     +1       6      13      17      18

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM