根据两列中带有`r`的反字符串值过滤唯一值

Question

I am trying to filter my dataset to get rid of doubled rows. 我试图过滤我的数据集，以摆脱双倍的行。 However, I want to do my filter on two different column that are identical if taken inversely (Origin-Destination data). 但是，我想对两个不同的列进行过滤，如果取相反的话，它们是相同的（原始目标数据）。 Here is an example of data: 这是数据示例：

data2<-matrix(NA, nrow = 7, ncol=5)  
colnames(data2)<-c("City.Pair", "Origin.City", "Destination.City", "Total.Passengers", "Total.Revenue")
data2[,1] <- c("LIS-BRU","LIS-LHR","LAD-LIS", "LIS-LAD", "FAO-MAN", "MAN-FAO","LIS-ORY")
data2[,2]<- c("LISBON", "LISBON", "LUANDA", "LISBON", "FARO", "MANCHESTER", "LISBON")
data2[,3] <- c("BRUSSELS","LONDON", "LISBON", "LUANDA", "MANCHESTER", "FARO", "PARIS" )
data2[,4] <- c(100, 5000, 200, 200, 4000, 4000, 4000)
data2[,5] <- c(100.66, 5000.25, 200.75, 200.75, 4000.10, 4000.10, 4000.05)
data2<-data.frame(data2)


  City.Pair Origin.City Destination.City Total.Passengers Total.Revenue
1   LIS-BRU      LISBON         BRUSSELS              100        100.66
2   LIS-LHR      LISBON           LONDON             5000       5000.25
3   LAD-LIS      LUANDA           LISBON              200        200.75
4   LIS-LAD      LISBON           LUANDA              200        200.75
5   FAO-MAN        FARO       MANCHESTER             4000        4000.1
6   MAN-FAO  MANCHESTER             FARO             4000        4000.1
7   LIS-ORY      LISBON            PARIS             4000       4000.05

I used the dplyr library and distinct which works fine with my number of passengers and revenue as with the code below: 我用dplyr库和distinct的正常工作与我的乘客人数和收入与下面的代码：

library(dplyr)
data4 <- distinct(data2, Total.Passengers, Total.Revenue)

However, my real dataset has millions of rows and sometimes, the number of passengers, for a same city-pair, is not exactly the same (difference of decimals). 但是，我的真实数据集有数百万行，有时，同一城市对的乘客数量并不完全相同（小数位不同）。 But, I still need to filter the data and keep only one record so I won't be counting twice the passengers and the revenue. 但是，我仍然需要过滤数据并仅保留一个记录，这样我就不会再计算两倍的乘客和收入。

Though, I am looking for a function that will allow me to filter based on the Origin and the Destination or on the City.Pair. 不过，我正在寻找一种功能，该功能将允许我根据起点和目的地或City.Pair进行过滤。

As part of my trials, I have tried to use the anti_join function by merging a doubled of the dataset but it does keep all the rows. 作为试验的一部分，我尝试通过合并数据集的两倍来使用anti_join函数，但它确实保留了所有行。 I also tried with the union but got the same result. 我也尝试过union但得到了相同的结果。

data3<- data2
data5<- anti_join(data2, data3, by=c("Origin.City" = "Destination.City", "Destination.City" = "Origin.City"))

My desired output should be something as follow: 我想要的输出应如下所示：

  City.Pair Origin.City Destination.City Total.Passengers Total.Revenue
1   LIS-BRU      LISBON         BRUSSELS              100        100.66
2   LIS-LHR      LISBON           LONDON             5000       5000.25
3   LAD-LIS      LUANDA           LISBON              200        200.75
4   FAO-MAN        FARO       MANCHESTER             4000        4000.1
5   LIS-ORY      LISBON            PARIS             4000       4000.05

What would be the best function for the task ? 这项任务的最佳功能是什么？ Or what can I correct in my actual code ? 或者我可以在我的实际代码中更正什么？

Thanks! 谢谢！

EDIT 编辑

How can I change the code to include another condition into the filtering? 如何更改代码以将其他条件包括在过滤中？ Let's say one row is coded and I also want to subset/filter based on that column. 假设一行已编码，我也想基于该列进行子集/过滤。

Here is the new dataframe: 这是新的数据框：

data2<-matrix(NA, nrow = 10, ncol=6)  
colnames(data2)<-c("City.Pair", "Origin.City", "Destination.City", "Total.Passengers", "Total.Revenue", "Code")
data2[,1] <- c("LIS-BRU","LIS-LHR","LAD-LIS", "LIS-LAD", "FAO-MAN", "MAN-FAO","LIS-ORY","LAD-LIS", "LAD-LIS", "LIS-LAD")
data2[,2]<- c("LISBON", "LISBON", "LUANDA", "LISBON", "FARO", "MANCHESTER", "LISBON","LUANDA", "LUANDA", "LISBON")
data2[,3] <- c("BRUSSELS","LONDON", "LISBON", "LUANDA", "MANCHESTER", "FARO", "PARIS","LISBON", "LISBON", "LUANDA")
data2[,4] <- c(100, 5000, 200, 200, 4000, 4000, 4000, 20, 40, 40)
data2[,5] <- c(100.66, 5000.25, 200.75, 200.75, 4000.10, 4000.10, 4000.05, 20.5, 40.8, 40.8)
data2[,6] <- c("F", "G","F", "F", "A", "A", "P", "H", "I", "I")
data2<-data.frame(data2)
data2

   City.Pair Origin.City Destination.City Total.Passengers Total.Revenue Code
1    LIS-BRU      LISBON         BRUSSELS              100        100.66    F
2    LIS-LHR      LISBON           LONDON             5000       5000.25    G
3    LAD-LIS      LUANDA           LISBON              200        200.75    F
4    LIS-LAD      LISBON           LUANDA              200        200.75    F
5    FAO-MAN        FARO       MANCHESTER             4000        4000.1    A
6    MAN-FAO  MANCHESTER             FARO             4000        4000.1    A
7    LIS-ORY      LISBON            PARIS             4000       4000.05    P
8    LAD-LIS      LUANDA           LISBON               20          20.5    H
9    LAD-LIS      LUANDA           LISBON               40          40.8    I
10   LIS-LAD      LISBON           LUANDA               40          40.8    I

So the desired output should be as follow: 因此，所需的输出应如下所示：

  City.Pair Origin.City Destination.City Total.Passengers Total.Revenue Code
1   LIS-BRU      LISBON         BRUSSELS              100        100.66    F
2   LIS-LHR      LISBON           LONDON             5000       5000.25    G
3   LAD-LIS      LUANDA           LISBON              200        200.75    F
5   FAO-MAN        FARO       MANCHESTER             4000       4000.10    A
7   LIS-ORY      LISBON            PARIS             4000       4000.05    P
8   LAD-LIS      LUANDA           LISBON               20         20.50    H
9   LAD-LIS      LUANDA           LISBON               40         40.80    I

I am performing multiple trials but can't perform the filter on two columns at the same time.. Here is my code: 我正在执行多次试验，但是无法同时在两列上执行过滤器。这是我的代码：

dat1<- 
  data2 %>%
  group_by(Code, City.Pair, Origin.City, Destination.City) %>%
  filter(Origin.City!=Destination.City & Destination.City!=Origin.City) %>%
  summarise(Passengers=sum(Total.Passengers), 
          Revenue=sum(Total.Revenue))

Answer 1

We can split the 'City.Pair' by '-', sort the elements in the list output, paste them together to give a vector`, check for duplicates ('i1') and use the logical vector to subset the rows of 'data2'. 我们可以将'City.Pair'除以'-'， sort list输出中的元素进行sort ，将paste them together to give a向量`，检查重复项（'i1'），然后使用逻辑向量将'数据2' 。

i1 <- !duplicated(apply(sapply(strsplit(as.character(data2$City.Pair), "-"), 
                sort), 2, paste, collapse="-"))
data2[i1,]
#    City.Pair Origin.City Destination.City Total.Passengers Total.Revenue
#1   LIS-BRU      LISBON         BRUSSELS              100        100.66
#2   LIS-LHR      LISBON           LONDON             5000       5000.25
#3   LAD-LIS      LUANDA           LISBON              200        200.75
#5   FAO-MAN        FARO       MANCHESTER             4000        4000.1
#7   LIS-ORY      LISBON            PARIS             4000       4000.05

Or using separate with pmin/pmax 或与pmin/pmax separate使用

library(dplyr)
library(tidyr)
separate(data2, City.Pair, into = c("City", "City2"), remove = FALSE) %>% 
         filter(!duplicated(pmin(City, City2), pmax(City, City2))) %>%
         select(-City, -City2)
#  City.Pair Origin.City Destination.City Total.Passengers Total.Revenue
#1   LIS-BRU      LISBON         BRUSSELS              100        100.66
#2   LIS-LHR      LISBON           LONDON             5000       5000.25
#3   LAD-LIS      LUANDA           LISBON              200        200.75
#4   FAO-MAN        FARO       MANCHESTER             4000        4000.1
#5   LIS-ORY      LISBON            PARIS             4000       4000.05

根据两列中带有`r`的反字符串值过滤唯一值

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-07-10 14:06:19

根据两列中带有`r`的反字符串值过滤唯一值

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-07-10 14:06:19

解决方案1
0 已采纳 2016-07-10 14:06:19