[英]R - Compare values in two columns in different rows
I have a dataframe df
as seen below with two features, a departing city and an arrival city. 我有一个数据框
df
,如下所示,具有两个功能,出发城市和到达城市。 Every two rows information is stored about a going and a return flight. 每两行存储一次往返航班信息。
Departure Arrival
1 A B
2 B A
3 F G
4 G F
5 U V
6 V U
7 K L
8 K L
There is some inconsistency in the data where the same flight is repeated as it can be seen in the last two rows. 正如在最后两行中可以看到的那样,重复相同的飞行的数据中存在一些不一致之处。
How is it possible to compare for every two rows the departure city of the first row and the arrival city of the second row, and keep the ones that are equal. 如何每两行比较第一行的出发城市和第二行的到达城市,并保持相等。 The dataset is very big and of course a for-loop is not considered an option.
数据集非常大,当然不考虑使用for循环。
Thank you in advance. 先感谢您。
Here is a method that compares the pairs of rows using head
and tail
to line them up. 下面是比较对使用行的方法
head
和tail
到线起来。
# find Departures that match the Arrival in the next row
sames <- which(head(dat$Departure, -1) == tail(dat$Arrival, -1))
# keep pairs of rows that match, maintaining order with `sort`
dat[sort(unique(c(sames, (sames + 1)))),]
Departure Arrival
1 A B
2 B A
3 F G
4 G F
5 U V
6 V U
Note that the two variables have to be character vectors, not factor variables. 请注意,这两个变量必须是字符向量,而不是因子变量。 you can coerce them to character using
as.character
if necessary. 您可以根据需要使用
as.character
来强制他们使用字符。
data 数据
dat <-
structure(list(Departure = c("A", "B", "F", "G", "U", "V", "K",
"K"), Arrival = c("B", "A", "G", "F", "V", "U", "L", "L")), .Names = c("Departure",
"Arrival"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8"))
So you just want unique flight paths? 因此,您只想要独特的飞行路线吗? there are a number of ways to do this, I'd think the fastest would be with data.table, something like:
有很多方法可以做到这一点,我认为最快的方法是使用data.table,例如:
library(data.table)
df <- as.data.table(df)
uniqueDf <- unique(df)
you can also use the duplicated function, something like 您还可以使用重复的功能,例如
df <- df[!duplicated(df), ]
should do nicely. 应该做得很好。
You could also do it this way: 您也可以这样进行:
right = rep(df[c(T,F),"Arrival"]==df[c(F,T),"Departure"],each=2)
df[right,]
This returns: 返回:
Departure Arrival
1 A B
2 B A
3 F G
4 G F
5 U V
6 V U
如果适合您,请尝试以下解决方案:
df[duplicated(paste0(df$Departure,df$Arrival))==F,]
This answer doesn't look for unique records, it specifically checks if a row is a duplicate of the row before. 此答案不是在查找唯一记录,而是专门检查某行是否与之前的行重复。
Adding a new column with a 1 if the row has repeated: 如果行已重复,则添加带有1的新列:
for(i in 2:length(df$Departure)){df$test[i]=ifelse(df$Departure[i] == df$Departure[i-1] & df$Arrival[i] == df$Arrival[i-1], 1,0)}
Loops can be slow though: 循环可能很慢:
library(data.table)
df$test2 = ifelse(df$Departure == shift(df$Departure) & df$Arrival == shift(df$Arrival), 1,0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.