[英]Searching Two Columns in a Data Frame in R
I have a question about searching for values in R, it is actually a bit similar to a question which was posted yesterday (as given over here: Searching a vector/data table backwards in R ) except I think my problem is a bit more complicated (and also the opposite of what I want to do), and since I'm very new to R I'm not too sure how to solve this problem. 我有一个关于在R中搜索值的问题,它实际上有点类似于昨天发布的问题(在这里给出: 在R中向后搜索向量/数据表 )除了我认为我的问题有点复杂(也和我想做的相反),因为我对R很新,所以我不太清楚如何解决这个问题。
I have a data frame similar to one given below, and I wish to find a previous index value to my current one where the Times
column is different to my current time and the Midquote
column does not have an NA
value. 我有一个类似于下面给出的数据框,我希望找到我当前的索引值,其中
Times
列与我当前时间不同,而且Midquote
列没有NA
值。
Index Times | Midquote
-----------------------------
1 10:30:45.58 | 5.319
2 10:30:45.93 | 5.323
3 10:30:45.104 | 5.325
4 10:30:45.127 | 5.322
5 10:30:45.188 | 5.325
6 10:30:45.188 | NA
7 10:30:45.212 | NA
8 10:30:45.231 | 5.321
9 10:30:45.231 | 5.321
If we start at the bottom of the data frame and take this to be the 'current' time, this is found to be at index 9 and which has a Times
value of 10:30:45.231
and Midquote
value of 5.321
, then if I want to find the first index where the time is different to my current time, we see this is found to be index 7, which has a time of 10:30:45.212
(since index 8 has the same time). 如果我们开始在数据帧的底部并借此为“当前”时间,这被发现是在索引9和具有
Times
的值10:30:45.231
和Midquote
的值5.321
,那么,如果我想要找到时间与我当前时间不同的第一个索引,我们看到这被发现是索引7,其时间为10:30:45.212
(因为索引8具有相同的时间)。 But we also see that at index 7 the Midquote
value is NA
so I now have to check the data frame again. 但我们也看到,在索引7处,
Midquote
值为NA
因此我现在必须再次检查数据帧。 Index 6 again has a different time (ie 10:30:45.188
) but it also has an NA
value again in the Midquote
column, so moving up again to index 5 we see that the Times
column has a different time to my current time (ie 10:30:45.188
again) and that the Midquotes
value is 5.325
. 索引6再次具有不同的时间(即
10:30:45.188
),但它在Midquote
列中也再次具有NA
值,因此再次向上移动到索引5,我们看到Times
列与当前时间的时间不同(即再次10:30:45.188
)并且Midquotes
值为5.325
。
Therefore, since at index 5 the time is 10:30:45.188
(which is different to my current time which was 10:30:45.231
) and since the Midquote
value at index 5 is not NA
, I wish to obtain the output '5' since it is the index value which fulfills both criteria. 因此,因为在索引5时间是
10:30:45.188
(这与我当前的时间不同,即10:30:45.231
),并且由于索引5处的Midquote
值不是NA
,我希望获得输出'5 '因为它是满足两个标准的指数值。
My question is, is there a good way of doing this? 我的问题是,有这样做的好方法吗? I am sorry if this is an easy question, I am very new to R and I don't know much about working with data frames...
我很抱歉,如果这是一个简单的问题,我对R很新,我对使用数据框架知之甚少...
EDIT: I would also like to do it preferably without adding another column to the data frame (as is given in the top answer of the link I mentioned above), if that is possible 编辑:我也想这样做,最好不要在数据框中添加另一列(如上面提到的链接的顶部答案中给出的),如果可能的话
Working with dates is tough especially with fractional seconds. 使用日期很困难,尤其是小数秒。 If you could convert the times to doubles it would be easier to work with.
如果你可以把时间转换成双打,那么就更容易使用了。 Assuming your 'Times' are in order you could use this
假设您的'时间'是有序的,您可以使用它
library(magrittr)
which(df$Times < df[9,1] & !is.na(df$Midquote)) %>% max()
The which
gives a vector of the 'Index' where 'Times' are less than that in 9 AND the 'Midquote' is not NA. which
给出了'Index'的向量,其中'Times'小于9中的'并且'Midquote'不是NA。 The %>%
sends the vector to max()
which gives the highest value. %>%
将向量发送到max()
,它给出最高值。 This is pretty inelegant, but will get the job done. 这是非常不优雅的,但将完成工作。
If I understood it correctly, please check if this is the output you are expecting. 如果我理解正确,请检查这是否是您期望的输出。
ind<-function(t,df){
ind<-t
while(t>1){
t=t-1
if((df$Times[t]!=df$Times[ind]) && (!is.na(df$Midquote[t]))){
return(t)
}
}
}
sapply((nrow(data):1),FUN = ind,data)
#[[1]]
#[1] 5
#[[2]]
#[1] 5
#[[3]]
#[1] 5
#[[4]]
#[1] 4
#[[5]]
#[1] 4
#[[6]]
#[1] 3
#[[7]]
#[1] 2
#[[8]]
#[1] 1
#[[9]]
#NULL
The output series corresponds to the associated index for your data.frame starting from the last row. 输出系列对应于data.frame的关联索引,从最后一行开始。
Explanation: ind
takes the value of row number as the current row , while t
takes value starting from ind-1
to 1. df
takes the entire data.frame as input and then while
loop is used to check if time and midquote value of df$Times[t]
and df$Midquote[t]
satisfy the required conditions. 说明:
ind
取作为当前行的行数的值,而t
取值从开始ind-1
为1。 df
拍摄整个data.frame作为输入,并且然后while
循环是用来检查是否时间和midquote值df$Times[t]
和df$Midquote[t]
满足所需条件。 If yes they return the index else the loop continues until it reaches the first row. 如果是,则返回索引,否则循环继续,直到到达第一行。
Without using sapply
for a particular current row: 不使用
sapply
用于特定的当前行:
ind(9,df)
[1] 5
Data.table
solution, 1 line. Data.table
解决方案,1行。
library(data.table)
dt <- data.table(Index = 1:9,
Times = c( '10:30:45.58', '10:30:45.93','10:30:45.104','10:30:45.127','10:30:45.188','10:30:45.188','10:30:45.212','10:30:45.231','10:30:45.231' ),
Midquote = c('5.319','5.323','5.325','5.322','5.325',NA,NA,'5.321','5.321')
)
> dt[ Times != Times[.N] & !is.na(Midquote), max(Index) ]
[1] 5
EDIT 编辑
To remove the Index column you have (at least) two options 要删除索引列,您至少有两个选项
dt2 <- data.table(Times = c( '10:30:45.58', '10:30:45.93','10:30:45.104','10:30:45.127','10:30:45.188','10:30:45.188','10:30:45.212','10:30:45.231','10:30:45.231' ),
Midquote = c('5.319','5.323','5.325','5.322','5.325',NA,NA,'5.321','5.321'))
# Option 1 - create an id column on the fly (unfortunately data.table recalculate .I after evaluating the "where" clause... so you need to save it)
dt2[, cbind(.SD, id=.I)][ Times != Times[.N] & !is.na(Midquote), max(id) ]
# Option 2 - simply check the last position of where your condition is met
dt2[, max(which(Times != Times[.N] & !is.na(Midquote))) ]
NB You can't do nrow
because you can have, say, the 1st, 2nd, and 4th records matching your condition, and nrow
would give you 3, which is wrong because the 3rd row does not match. NB你不能做
nrow
,因为你可以有,比如说,第一,第二和第四个记录符合条件,并nrow
会给你3个,这是错误的,因为第3行不匹配。
EDIT 2 (option 3 is not correct ) 编辑2 (选项3不正确 )
dt3 <- data.table(Times = c( '10:30:45.58', '10:30:45.93','10:30:45.104','10:30:45.127','10:30:45.188','10:30:45.188','10:30:45.212','10:30:45.231','10:30:45.231' ),
Midquote = c('5.319','5.323', NA,'5.322','5.325', NA, NA,'5.321','5.321'))
# Option 1 - create an id column on the fly (unfortunately data.table recalculate .I after evaluating the "where" clause... so you need to save it)
dt3[, cbind(.SD, id=.I)][ Times != Times[.N] & !is.na(Midquote), max(id) ]
[1] 5
# Option 2 - simply check the last position of where your condition is met
dt3[, max(which(Times != Times[.N] & !is.na(Midquote))) ]
[1] 5
# Option 3 - good luck with this
nrow(dt3[Times != Times[.N] & !is.na(Midquote)])
[1] 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.