简体   繁体   中英

R How to remove duplicates from a list of lists

I have a list of lists that contain the following 2 variables:

> dist_sub[[1]]$zip
 [1] 901 902 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
[26] 929 930 931 933 934 935 936 937 938 939 940 955 961 962 963 965 966 968 969 970 975 981

> dist_sub[[1]]$hu
 [1]  4990    NA   168 13224    NA  3805    NA  6096  3884  4065    NA 16538    NA 12348 10850    NA
[17]  9322 17728    NA 13969 24971  5413 47317  7893    NA    NA    NA    NA    NA   140    NA     4
[33]    NA    NA    NA    NA    NA 13394  8939    NA  3848  7894  2228 17775    NA    NA    NA



> dist_sub[[2]]$zip
 [1] 921 934 952 956 957 958 959 960 961 962 965 966 968 969 970 971

> dist_sub[[2]]$hu
 [1] 17728   140  4169 32550 18275    NA 22445     0 13394  8939  3848  7894  2228 17775    NA 12895

Is there a way remove duplicates such that if a zipcode appears in one list is removed from other lists according to specific criteria?

Example: zipcode 00921 is present in the two lists above. I'd like to keep it only on the list with the lowest sum of hu (housing units). In this I would like to keep zipcode 00921 in the 2nd list only since the sum of hu is 162,280 in list 2 versus 256,803 in list 1.

Any help is very much appreciated.

Here is a simulate dataset for your problem so that others can use it too.

dist_sub <- list(list("zip"=1:10,
                      "hu"=rnorm(10)),
                list("zip"=8:12,
                      "hu"=rnorm(5)),
                list("zip"=c(1, 3, 11, 7),
                      "hu"=rnorm(4))
                )

Here's a solution that I was able to come up with. I realized that loops were really the cleaner way to do this:

do.this <- function (x) {
  for(k in 1:(length(x) - 1)) {
    for (l in (k + 1):length(x)) {
      to.remove <- which(x[[k]][["zip"]] %in% x[[l]][["zip"]])
      x[[k]][["zip"]] <- x[[k]][["zip"]][-to.remove]
      x[[k]][["hu"]] <- x[[k]][["hu"]][-to.remove]
    }
  }
  return(x)
}

The idea is really simple: for each set of zips we keep removing the elements that are repeated in any set after it. We do this until the penultimate set because the last set will be left with no repeats in anything before it.

The solution to use the criterion you have, ie lowest sum of hu can be easily implemented using the function above. What you need to do is reorder the list dist_sub by sum of hu like so:

sum_hu <- sapply(dist_sub, function (k) sum(k[["hu"]], na.rm=TRUE))
dist_sub <- dist_sub[order(sum_hu, decreasing=TRUE)]

Now you have dist_sub sorted by sum_hu which means that for each set the sets that come before it have larger sum_hu . Therefore, if sets at values i and j (i < j) have values a in common, then a should be removed from i th element. That is what this solution does too. Do you think that makes sense?

PS: I've called the function do.this because I usually like writing generic solutions while this was a very specific question, albeit, an interesting one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM