简体   繁体   中英

r - Check how many times each value on a vector is on a set of areas

I have two dataframes, the first one has the coordinates of some points, and the other one has a set of areas, with the limits both on lat and lon. I want to know for each point, the area (or areas) on which it falls, and the total capacity it has available.

For example, df1 has the points, and df2 has the areas and capacities

df1 <- data.frame(cluster = c("id1", "id2", "id3"),
              lat_m = c(-3713015, -4086295, -3710672),
              lon_m = c(-6556760, -6516930, -6569831))
df2 <- data.frame(id = c("a1","a2","a3"),
              max_lat = c(-3713013,-3713000, -3710600),
              min_lat = c(-3713017,-3713100, -3710700),
              max_lon = c(-6556755,-6556740, -6569820),
              min_lon = c(-6556765,-6556800, -6569840),
              capacity = c(5,2,3))

I want to get something like this

result <- data.frame(cluster = c("id1", "id2", "id3"),
                 areas = c(2, 0, 1),
                 areas_id = c("a1, a2", "", "a3"),
                 capacity = c(7, 0, 3))

My data has over 1 million points and over 10000 areas (it will increase), so ideally I should avoid for loops

You can join the two tables together on >= and <= conditions, then summarise by cluster group.

library(data.table)
library(magrittr) # not necessary, just loaded for %>%
setDT(df1)
setDT(df2)

df2[df1, on = .(min_lat <= lat_m, max_lat >= lat_m, min_lon <= lon_m, max_lon >= lon_m)
    , .(cluster, id, capacity)] %>% # these first two lines do the join
  .[, .(areas = sum(!is.na(capacity))
       , areas_id = paste(id, collapse = ', ')
       , capacity = sum(capacity, na.rm = T))
    , by = cluster] # this summarises each cluster group of rows


#    cluster areas areas_id capacity
# 1:     id1     2   a1, a2        7
# 2:     id2     0       NA        0
# 3:     id3     1       a3        3

SQL code version (partially stolen from @shree's answer) :

library(sqldf)

sqldf("
select    df1.cluster
          , case  when sum(df2.capacity) is NULL
                    then 0
                  else count(*)
          end as areas
          , group_concat(df2.id) as areas_id
          , coalesce(sum(df2.capacity), 0) as capacity
from      df1 
          left join df2 
          on  df1.lat_m between df2.min_lat and df2.max_lat 
              and df1.lon_m between df2.min_lon and df2.max_lon
group by  df1.cluster
")

#   cluster areas areas_id capacity
# 1     id1     2    a1,a2        7
# 2     id2     0     <NA>        0
# 3     id3     1       a3        3

Here's a solution using sqldf and dplyr -

library(sqldf)
library(dplyr)

sql <- paste0(
         "SELECT df1.cluster, df2.id, df2.capacity ",
         "FROM df1 LEFT JOIN df2 ON (df1.lat_m BETWEEN df2.min_lat AND df2.max_lat) AND ",
         "(df1.lon_m BETWEEN df2.min_lon AND df2.max_lon)"
       )

result <- sqldf(sql) %>%
  group_by(cluster) %>%
  summarise(
    areas = n_distinct(id) - anyNA(id),
    areas_id = toString(id),
    capacity = sum(capacity, na.rm = T)
  )

# A tibble: 3 x 4
  cluster areas areas_id capacity
  <fct>   <int> <chr>       <dbl>
1 id1         2 a1, a2       7.00
2 id2         0 NA           0   
3 id3         1 a3           3.00

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM