I have two dataframes, the first one has the coordinates of some points, and the other one has a set of areas, with the limits both on lat and lon. I want to know for each point, the area (or areas) on which it falls, and the total capacity it has available.
For example, df1 has the points, and df2 has the areas and capacities
df1 <- data.frame(cluster = c("id1", "id2", "id3"),
lat_m = c(-3713015, -4086295, -3710672),
lon_m = c(-6556760, -6516930, -6569831))
df2 <- data.frame(id = c("a1","a2","a3"),
max_lat = c(-3713013,-3713000, -3710600),
min_lat = c(-3713017,-3713100, -3710700),
max_lon = c(-6556755,-6556740, -6569820),
min_lon = c(-6556765,-6556800, -6569840),
capacity = c(5,2,3))
I want to get something like this
result <- data.frame(cluster = c("id1", "id2", "id3"),
areas = c(2, 0, 1),
areas_id = c("a1, a2", "", "a3"),
capacity = c(7, 0, 3))
My data has over 1 million points and over 10000 areas (it will increase), so ideally I should avoid for loops
You can join the two tables together on >=
and <=
conditions, then summarise by cluster
group.
library(data.table)
library(magrittr) # not necessary, just loaded for %>%
setDT(df1)
setDT(df2)
df2[df1, on = .(min_lat <= lat_m, max_lat >= lat_m, min_lon <= lon_m, max_lon >= lon_m)
, .(cluster, id, capacity)] %>% # these first two lines do the join
.[, .(areas = sum(!is.na(capacity))
, areas_id = paste(id, collapse = ', ')
, capacity = sum(capacity, na.rm = T))
, by = cluster] # this summarises each cluster group of rows
# cluster areas areas_id capacity
# 1: id1 2 a1, a2 7
# 2: id2 0 NA 0
# 3: id3 1 a3 3
SQL code version (partially stolen from @shree's answer) :
library(sqldf)
sqldf("
select df1.cluster
, case when sum(df2.capacity) is NULL
then 0
else count(*)
end as areas
, group_concat(df2.id) as areas_id
, coalesce(sum(df2.capacity), 0) as capacity
from df1
left join df2
on df1.lat_m between df2.min_lat and df2.max_lat
and df1.lon_m between df2.min_lon and df2.max_lon
group by df1.cluster
")
# cluster areas areas_id capacity
# 1 id1 2 a1,a2 7
# 2 id2 0 <NA> 0
# 3 id3 1 a3 3
Here's a solution using sqldf
and dplyr
-
library(sqldf)
library(dplyr)
sql <- paste0(
"SELECT df1.cluster, df2.id, df2.capacity ",
"FROM df1 LEFT JOIN df2 ON (df1.lat_m BETWEEN df2.min_lat AND df2.max_lat) AND ",
"(df1.lon_m BETWEEN df2.min_lon AND df2.max_lon)"
)
result <- sqldf(sql) %>%
group_by(cluster) %>%
summarise(
areas = n_distinct(id) - anyNA(id),
areas_id = toString(id),
capacity = sum(capacity, na.rm = T)
)
# A tibble: 3 x 4
cluster areas areas_id capacity
<fct> <int> <chr> <dbl>
1 id1 2 a1, a2 7.00
2 id2 0 NA 0
3 id3 1 a3 3.00
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.