简体   繁体   English

在R中搜索数据框中的两列

[英]Searching Two Columns in a Data Frame in R

I have a question about searching for values in R, it is actually a bit similar to a question which was posted yesterday (as given over here: Searching a vector/data table backwards in R ) except I think my problem is a bit more complicated (and also the opposite of what I want to do), and since I'm very new to R I'm not too sure how to solve this problem. 我有一个关于在R中搜索值的问题,它实际上有点类似于昨天发布的问题(在这里给出: 在R中向后搜索向量/数据表 )除了我认为我的问题有点复杂(也和我想做的相反),因为我对R很新,所以我不太清楚如何解决这个问题。

I have a data frame similar to one given below, and I wish to find a previous index value to my current one where the Times column is different to my current time and the Midquote column does not have an NA value. 我有一个类似于下面给出的数据框,我希望找到我当前的索引值,其中Times列与我当前时间不同,而且Midquote列没有NA值。

Index               Times    |    Midquote
                -----------------------------
   1            10:30:45.58  |    5.319
   2            10:30:45.93  |    5.323
   3            10:30:45.104 |    5.325
   4            10:30:45.127 |    5.322
   5            10:30:45.188 |    5.325
   6            10:30:45.188 |    NA
   7            10:30:45.212 |    NA
   8            10:30:45.231 |    5.321
   9            10:30:45.231 |    5.321

If we start at the bottom of the data frame and take this to be the 'current' time, this is found to be at index 9 and which has a Times value of 10:30:45.231 and Midquote value of 5.321 , then if I want to find the first index where the time is different to my current time, we see this is found to be index 7, which has a time of 10:30:45.212 (since index 8 has the same time). 如果我们开始在数据帧的底部并借此为“当前”时间,这被发现是在索引9和具有Times的值10:30:45.231Midquote的值5.321 ,那么,如果我想要找到时间与我当前时间不同的第一个索引,我们看到这被发现是索引7,其时间为10:30:45.212 (因为索引8具有相同的时间)。 But we also see that at index 7 the Midquote value is NA so I now have to check the data frame again. 但我们也看到,在索引7处, Midquote值为NA因此我现在必须再次检查数据帧。 Index 6 again has a different time (ie 10:30:45.188 ) but it also has an NA value again in the Midquote column, so moving up again to index 5 we see that the Times column has a different time to my current time (ie 10:30:45.188 again) and that the Midquotes value is 5.325 . 索引6再次具有不同的时间(即10:30:45.188 ),但它在Midquote列中也再次具有NA值,因此再次向上移动到索引5,我们看到Times列与当前时间的时间不同(即再次10:30:45.188 )并且Midquotes值为5.325

Therefore, since at index 5 the time is 10:30:45.188 (which is different to my current time which was 10:30:45.231 ) and since the Midquote value at index 5 is not NA , I wish to obtain the output '5' since it is the index value which fulfills both criteria. 因此,因为在索引5时间是10:30:45.188 (这与我当前的时间不同,即10:30:45.231 ),并且由于索引5处的Midquote值不是NA ,我希望获得输出'5 '因为它是满足两个标准的指数值。

My question is, is there a good way of doing this? 我的问题是,有这样做的好方法吗? I am sorry if this is an easy question, I am very new to R and I don't know much about working with data frames... 我很抱歉,如果这是一个简单的问题,我对R很新,我对使用数据框架知之甚少...

EDIT: I would also like to do it preferably without adding another column to the data frame (as is given in the top answer of the link I mentioned above), if that is possible 编辑:我也想这样做,最好不要在数据框中添加另一列(如上面提到的链接的顶部答案中给出的),如果可能的话

Working with dates is tough especially with fractional seconds. 使用日期很困难,尤其是小数秒。 If you could convert the times to doubles it would be easier to work with. 如果你可以把时间转换成双打,那么就更容易使用了。 Assuming your 'Times' are in order you could use this 假设您的'时间'是有序的,您可以使用它

library(magrittr)
which(df$Times < df[9,1] & !is.na(df$Midquote)) %>% max()

The which gives a vector of the 'Index' where 'Times' are less than that in 9 AND the 'Midquote' is not NA. which给出了'Index'的向量,其中'Times'小于9中的'并且'Midquote'不是NA。 The %>% sends the vector to max() which gives the highest value. %>%将向量发送到max() ,它给出最高值。 This is pretty inelegant, but will get the job done. 这是非常不优雅的,但将完成工作。

If I understood it correctly, please check if this is the output you are expecting. 如果我理解正确,请检查这是否是您期望的输出。

ind<-function(t,df){
    ind<-t
    while(t>1){
       t=t-1
        if((df$Times[t]!=df$Times[ind]) && (!is.na(df$Midquote[t]))){
            return(t)
        }
    }
}
sapply((nrow(data):1),FUN = ind,data)

#[[1]]
#[1] 5

#[[2]]
#[1] 5

#[[3]]
#[1] 5

#[[4]]
#[1] 4

#[[5]]
#[1] 4

#[[6]]
#[1] 3

#[[7]]
#[1] 2

#[[8]]
#[1] 1

#[[9]]
#NULL

The output series corresponds to the associated index for your data.frame starting from the last row. 输出系列对应于data.frame的关联索引,从最后一行开始。

Explanation: ind takes the value of row number as the current row , while t takes value starting from ind-1 to 1. df takes the entire data.frame as input and then while loop is used to check if time and midquote value of df$Times[t] and df$Midquote[t] satisfy the required conditions. 说明: ind取作为当前行的行数的值,而t取值从开始ind-1为1。 df拍摄整个data.frame作为输入,并且然后while循环是用来检查是否时间和midquote值df$Times[t]df$Midquote[t]满足所需条件。 If yes they return the index else the loop continues until it reaches the first row. 如果是,则返回索引,否则循环继续,直到到达第一行。

Without using sapply for a particular current row: 不使用sapply用于特定的当前行:

 ind(9,df)
 [1] 5

Data.table solution, 1 line. Data.table解决方案,1行。

library(data.table)

dt <- data.table(Index = 1:9,
                 Times = c( '10:30:45.58', '10:30:45.93','10:30:45.104','10:30:45.127','10:30:45.188','10:30:45.188','10:30:45.212','10:30:45.231','10:30:45.231' ),
                 Midquote = c('5.319','5.323','5.325','5.322','5.325',NA,NA,'5.321','5.321')
                )

> dt[ Times != Times[.N] & !is.na(Midquote), max(Index) ]
[1] 5

EDIT 编辑

To remove the Index column you have (at least) two options 要删除索引列,您至少有两个选项

dt2 <- data.table(Times = c( '10:30:45.58', '10:30:45.93','10:30:45.104','10:30:45.127','10:30:45.188','10:30:45.188','10:30:45.212','10:30:45.231','10:30:45.231' ),
                  Midquote = c('5.319','5.323','5.325','5.322','5.325',NA,NA,'5.321','5.321'))


# Option 1 - create an id column on the fly (unfortunately data.table recalculate .I after evaluating the "where" clause... so you need to save it)
dt2[, cbind(.SD, id=.I)][ Times != Times[.N] & !is.na(Midquote), max(id) ]

# Option 2 - simply check the last position of where your condition is met
dt2[, max(which(Times != Times[.N] & !is.na(Midquote))) ]

NB You can't do nrow because you can have, say, the 1st, 2nd, and 4th records matching your condition, and nrow would give you 3, which is wrong because the 3rd row does not match. NB你不能做nrow ,因为你可以有,比如说,第一,第二和第四个记录符合条件,并nrow会给你3个,这是错误的,因为第3行不匹配。

EDIT 2 (option 3 is not correct ) 编辑2 (选项3不正确

dt3 <- data.table(Times = c( '10:30:45.58', '10:30:45.93','10:30:45.104','10:30:45.127','10:30:45.188','10:30:45.188','10:30:45.212','10:30:45.231','10:30:45.231' ),
                  Midquote = c('5.319','5.323', NA,'5.322','5.325', NA, NA,'5.321','5.321'))


# Option 1 - create an id column on the fly (unfortunately data.table recalculate .I after evaluating the "where" clause... so you need to save it)
dt3[, cbind(.SD, id=.I)][ Times != Times[.N] & !is.na(Midquote), max(id) ]
[1] 5

# Option 2 - simply check the last position of where your condition is met
dt3[, max(which(Times != Times[.N] & !is.na(Midquote))) ]
[1] 5

# Option 3 - good luck with this
nrow(dt3[Times != Times[.N] & !is.na(Midquote)])
[1] 4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM