简体   繁体   中英

How can I avoid double counts from overlapping areas in Postgis?

I want to compute the impact of events in a town using Postgis. I have a table with point locations (event_count_2019_geo) of the events and a table containing all buildings of the town (utrecht_2020) as well in point locations. I count all houses around the event in a range of slightly more than 200 meters and count the number of inhabited houses. See code below.

-- In a range of ~200 meters
UPDATE event_count_2019_geo
SET gw200 = temp.aantal_woningen
FROM (SELECT locatie, count(event_count_2019_geo.locatie) AS aantal_woningen
      FROM event_count_2019_geo
           INNER JOIN utrecht_2020 AS bag ON (ST_DWithin(bag.geo_lokatie, event_count_2019_geo.geo_lokatie, 0.002))  
      WHERE  bag.verblijfsobjectgebruiksdoel LIKE '%woonfunctie%'
      GROUP BY locatie
     ) AS temp
WHERE event_count_2019_geo.locatie = temp.locatie;

Trouble is that I end up with way too many houses being impacted by the event. I made a drawing of all ranges of 200m around each event (see picture below). The overlapping areas are counted twice, thrice or event four times. The houses are counted correctly for each event but I cannot sum the results. Is there a way to correct for these overlaps so that I can come at a correct total of the number of houses over all selected events?

每个事件周围 200 米范围

Edit: Example

Just a very simple example: a query of event 1 yields the houses A, B, D; event 2 = C, D, E. The count for each event is 3, their sum is 6 (which is correct behavior indeed) and what I would like to see is 5, as D is counted double.

Thanks to the suggestion of @JimJones I found the solution. I defined two views: one in the old way that finds all houses (find_houses_all) and the other to only return unique houses (find_houses_unique).

-- Find all houses within a radius of ~200m of an event
DROP VIEW IF EXISTS find_houses_all;

CREATE VIEW find_houses_all AS 
    SELECT bag.openbareruimte, bag.huisnummer, bag.huisletter, bag.huisnummertoevoeging,
           event_count_2019_geo.locatie
    FROM event_count_2019_geo
         INNER JOIN utrecht_2020 AS bag ON (ST_DWithin(bag.geo_lokatie, event_count_2019_geo.geo_lokatie, 0.002));  

-- Find all *unique* houses within a radius of ~200m of an event 
-- Each house is uniquely identiefied by openbareruimte, huisnummer, huisletter
-- and huisnummertoevoeging, so these are the columns to apply DISTINCT ON
DROP VIEW IF EXISTS find_houses_unique;

CREATE VIEW find_houses_unique AS 
    SELECT DISTINCT ON(bag.openbareruimte, bag.huisnummer, bag.huisletter, bag.huisnummertoevoeging) 
           bag.openbareruimte, bag.huisnummer, bag.huisletter, bag.huisnummertoevoeging,
           event_count_2019_geo.locatie
    FROM event_count_2019_geo
         INNER JOIN utrecht_2020 AS bag ON (ST_DWithin(bag.geo_lokatie, event_count_2019_geo.geo_lokatie, 0.002));

I ran both scripts and got indeed output as I expected.

SELECT locatie, COUNT (locatie)
FROM find_houses_all -- find_houses_unique
GROUP BY locatie
ORDER BY locatie;

The output for find_houses_all is in all cases more or equal than the output for find_houses_unique. Sample output in a spreadsheet and subtracted looks as follows:

Locatie         All Unique  All - Unique
achter st.-ptr. 617 222     395
berlijnplein    87   87       0
boothstraat     653 175     478
breedstraat    1057 564     493
buurkerkhof     914 163     751
catharijnesngl. 134  38      96
domplein        842 149     693
 ...
Total         35399 13196   22203

negative numbers would have indicated an error.

great one of you data scientists. I am learning! In this problem, as conventional statistician i would have used set theory algorithm to obtain unique counts of the impacted cases (houses) ie n(AUB) = n(A) + n(B) -n(A-intersection-B)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM