简体   繁体   English

R如何从列表列表中删除重复项

[英]R How to remove duplicates from a list of lists

I have a list of lists that contain the following 2 variables: 我有一个包含以下2个变量的列表列表:

> dist_sub[[1]]$zip
 [1] 901 902 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
[26] 929 930 931 933 934 935 936 937 938 939 940 955 961 962 963 965 966 968 969 970 975 981

> dist_sub[[1]]$hu
 [1]  4990    NA   168 13224    NA  3805    NA  6096  3884  4065    NA 16538    NA 12348 10850    NA
[17]  9322 17728    NA 13969 24971  5413 47317  7893    NA    NA    NA    NA    NA   140    NA     4
[33]    NA    NA    NA    NA    NA 13394  8939    NA  3848  7894  2228 17775    NA    NA    NA



> dist_sub[[2]]$zip
 [1] 921 934 952 956 957 958 959 960 961 962 965 966 968 969 970 971

> dist_sub[[2]]$hu
 [1] 17728   140  4169 32550 18275    NA 22445     0 13394  8939  3848  7894  2228 17775    NA 12895

Is there a way remove duplicates such that if a zipcode appears in one list is removed from other lists according to specific criteria? 有没有一种方法可以删除重复项,从而如果一个邮政编码出现在一个列表中,则可以根据特定条件将其从其他列表中删除?

Example: zipcode 00921 is present in the two lists above. 示例:邮政编码00921在上面的两个列表中。 I'd like to keep it only on the list with the lowest sum of hu (housing units). 我只想将其保留在hu(住房单位)总和最低的清单上。 In this I would like to keep zipcode 00921 in the 2nd list only since the sum of hu is 162,280 in list 2 versus 256,803 in list 1. 在此我只想将邮政编码00921保留在第二个列表中,因为hu的总和在列表2中为162,280,而列表1中为256,803。

Any help is very much appreciated. 很感谢任何形式的帮助。

Here is a simulate dataset for your problem so that others can use it too. 这是针对您的问题的模拟数据集,以便其他人也可以使用它。

dist_sub <- list(list("zip"=1:10,
                      "hu"=rnorm(10)),
                list("zip"=8:12,
                      "hu"=rnorm(5)),
                list("zip"=c(1, 3, 11, 7),
                      "hu"=rnorm(4))
                )

Here's a solution that I was able to come up with. 这是我能够想到的解决方案。 I realized that loops were really the cleaner way to do this: 我意识到循环确实是一种更干净的方法:

do.this <- function (x) {
  for(k in 1:(length(x) - 1)) {
    for (l in (k + 1):length(x)) {
      to.remove <- which(x[[k]][["zip"]] %in% x[[l]][["zip"]])
      x[[k]][["zip"]] <- x[[k]][["zip"]][-to.remove]
      x[[k]][["hu"]] <- x[[k]][["hu"]][-to.remove]
    }
  }
  return(x)
}

The idea is really simple: for each set of zips we keep removing the elements that are repeated in any set after it. 这个想法真的很简单:对于每组zip,我们都会删除其后任何一组中重复的元素。 We do this until the penultimate set because the last set will be left with no repeats in anything before it. 我们这样做直到倒数第二个集合,因为最后一个集合将不留任何重复。

The solution to use the criterion you have, ie lowest sum of hu can be easily implemented using the function above. 使用上面的函数可以轻松地实现使用您拥有的条件的解决方案,即hu最低和。 What you need to do is reorder the list dist_sub by sum of hu like so: 您需要做的是按hu的总和对列表dist_sub重新排序,如下所示:

sum_hu <- sapply(dist_sub, function (k) sum(k[["hu"]], na.rm=TRUE))
dist_sub <- dist_sub[order(sum_hu, decreasing=TRUE)]

Now you have dist_sub sorted by sum_hu which means that for each set the sets that come before it have larger sum_hu . 现在,您已按sum_hudist_sub进行排序,这意味着对于每个集合,其之前的集合具有更大的sum_hu Therefore, if sets at values i and j (i < j) have values a in common, then a should be removed from i th element. 因此,如果在组值ij (I'j)的具有值a共同点,然后a应当从中移除i个元素。 That is what this solution does too. 这也是该解决方案的作用。 Do you think that makes sense? 您认为这有意义吗?

PS: I've called the function do.this because I usually like writing generic solutions while this was a very specific question, albeit, an interesting one. PS:我称此函数为do.this因为我通常喜欢编写通用解决方案,尽管这是一个非常具体的问题,尽管这是一个有趣的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM