简体   繁体   中英

Subsetting by row value in R

Firstly apologies if this has been answered else where I have searched all over for this but cannot find the answer. My problem may be due to the way I am searching for info and so I decided to use StackOverflow so I can present my problem with examples.

I have five minute OHLC data DIA_5.csv which I have then added a DayOfYear column to using Lubridate;

library(lubridate) DIA_5[,6]<- yday(DIA_5[,1])

Which looks like this;

    Date                Open        High        Low         Close       DOY 
1   2015-09-21 09:30:00 164.6700    164.7100    164.3700    164.5300    264
2   2015-09-21 09:35:00 164.5300    164.9000    164.5300    164.6400    264
3   2015-09-21 09:40:00 164.6600    164.8900    164.6000    164.8900    264
4   2015-09-21 09:45:00 164.9100    165.0900    164.9100    164.9736    264
5   2015-09-21 09:50:00 164.9399    165.0980    164.8200    164.8200    264

What I wanted to do was create a new d$f with the first column starting with the individual day of year numbers, then I would fill this new d$f by subsetting data from the original OHLC d$f using the day of year number. The aim of this is so that in the new d$f I can extract the MAX value from all the highs on day x and bring it to column on the new d$f and so on with the other variables. The closest I could get to this goal was by using the following code however, this returns me all the values from the OHLC and I cannot find away to change this so that only the day of year number is brought across to the new d$f.

DF<-DIA_5[match(unique(DIA_5[,6]), DIA_5[,6]),]

  row.names DATE    OPEN    HIGH    LOW CLOSE   DOY
1   1   2015-09-21 09:30:00 164.67  164.7100    164.370 164.5300    264
2   79  2015-09-22 09:30:00 162.62  162.9600    162.620 162.7544    265
3   157 2015-09-23 09:30:00 163.26  163.3800    162.980 163.1400    266
4   235 2015-09-24 09:30:00 161.12  161.3700    161.060 161.2300    267
5   313 2015-09-25 09:30:00 163.81  163.9100    163.570 163.5800    268

Despite having more data than needed using the above code I decided to try subsetting the data. So from the above I wanted in the row next to 264 to use this value as a filter on the main OHLC d$f then extract the highest value in the column of highs. Using

DF[,6] <- max(subset(DIA_5[,3], yday(DIA_5[,1]) == DF[,6] ))

gave me

Warning message:
In yday(DIA_5[, 1]) == DF[, 6] :
longer object length is not a multiple of shorter object length

It did give a new column on d$f but this had the same value repeated.

row.names   DATE    OPEN    HIGH    LOW CLOSE   DOY
1   1   2015-09-21 09:30:00 164.67  164.7100    164.370 164.5300    179.02
2   79  2015-09-22 09:30:00 162.62  162.9600    162.620 162.7544    179.02
3   157 2015-09-23 09:30:00 163.26  163.3800    162.980 163.1400    179.02
4   235 2015-09-24 09:30:00 161.12  161.3700    161.060 161.2300    179.02
5   313 2015-09-25 09:30:00 163.81  163.9100    163.570 163.5800    179.02
6   391 2015-09-28 09:30:00 162.04  162.0600    161.660 161.7100    179.02

I tried using my subset syntax to pull the max high value from a random DOY number and it seems to work fine;

h <- max(subset(DIA_5[,3], yday(DIA_5[,1]) == DF[1,6] ))

But I just cannot find out how to do this so that it creates a new column of the MAX value in the high column on x day of year.

Any help with this would be very much appreciated.

You can use dplyr .

I created some fake data which looks like this and stored it in df :

    Date     Open    High    Low    Close DOY
1 2015-09-21 164.6700 164.710 164.37 164.5300 264
2 2015-09-21 164.5300 164.900 164.53 164.6400 264
3 2015-09-21 164.6600 164.890 164.60 164.8900 264
4 2015-09-22 164.9100 165.090 164.91 164.9736 265
5 2015-09-22 164.9399 165.098 164.82 164.8200 265
6 2015-09-22 162.6200 162.960 162.62 162.7544 265
7 2015-09-23 163.2600 163.380 162.98 163.1400 266
8 2015-09-23 161.1200 161.370 161.06 161.2300 266
9 2015-09-23 163.8100 163.910 163.57 163.5800 266

library(dplyr)
x <- df %>% 
  group_by(DOY) %>% 
  filter(High == max(High)) %>% 
  as.data.frame()
x
        Date     Open    High    Low  Close DOY
1 2015-09-21 164.5300 164.900 164.53 164.64 264
2 2015-09-22 164.9399 165.098 164.82 164.82 265
3 2015-09-23 163.8100 163.910 163.57 163.58 266

aggregate is a fine 'one liner' for this

#simulate some time series and place in data.frame
set.seed(1)
d = data.frame(replicate(5,cumsum(rnorm(2000))))
d$doy = sort(sample(1:364,2000,replace=T))
print(d[d$doy==1,])

          X1          X2         X3         X4         X5 doy
1 -0.6264538 -0.88614959 -1.1346302 -0.6188271  0.2637034   1
2 -0.4428105 -2.80840448 -0.3700731 -1.7282490 -0.5657484   1
3 -1.2784391 -1.18870374  0.2006371 -3.8985843 -2.0273832   1
4  0.3168417 -0.66943383 -1.1510569 -3.9298873 -0.3433930   1
5  0.6463495 -0.72528376 -3.1809423 -4.1902858 -1.8877173   1
6 -0.1741189 -0.02886615 -2.5904637 -3.6558553 -2.0786045   1
7  0.3133101  0.02464952 -4.0035337 -4.2152947 -1.0623928   1
8  1.0516348 -1.28563397 -2.3931921 -2.6069245 -0.5152666   1
9  1.6274162 -3.40870003 -0.5527496 -2.0502848  0.2398875   1

#aggregate data by DOY and compute some statistics for each column
maxPerDOY.df = aggregate(d[1:5],list(doy=d$doy),max)
print(head(maxPerDOY.df,3))

  doy       X1          X2         X3         X4        X5
1   1 1.627416  0.02464952  0.2006371 -0.6188271 0.2637034
2   2 3.223652 -2.76920768  0.8155484 -1.8646623 2.1378466
3   3 3.216576 -3.39431265 -0.8062283 -0.6656144 2.9014736

Using the advice given Teja KI managed to code all the subletting that was required for my project. dplyr is an excellent package and designed exactly for this. They syntax is also incredibly easy for noobs like me. Thanks for all the hall guys.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM