简体   繁体   中英

Fastest approach for geopandas (reading and spatialJoin)

I have about a million rows of data with lat and lon attached, and more to come. Even now reading the data from SQLite file (I read it with pandas, then create a point for each row) takes a lot of time.

Now, I need to make a spatial joint over those points to get a zip code to each one, and I really want to optimise this process.

So I wonder: if there is any relatively easy way to parallelize those computations?

I am assuming you have already implemented GeoPandas and are still finding difficulties? you can improve this by further hashing your coords data. similar to how google hashes their search data. Some databases already provide support for these types of operations (eg mongodb). Imagine if you took the first (left) digit of your coords, and put each set of cooresponding data into a seperate sqlite file. each digit can be a hash pointing to the correct file to look for. now your lookup time has improved by a factor of 20 ( range(-9,10) ), assuming your hash lookup takes minimal time in comparison

As it turned out, the most convenient solution in my case is to use pandas.read_SQL function with specific chunksize parameter. In this case, it returns a generator of data chunks, which can be effectively feed to the mp.Pool().map() along with the job; In this (my) case job consists of 1) reading geoboundaries, 2) spatial joint of the chunk 3) writing the chunk to the database.

This method is completely dependent on your spatial scale, but one way you might parallelize your join would be to subdivide your polygons into subpolygons and then offload the work to separate threads in separate cores. This geopandas r-tree tutorial demonstrates that technique, subdividing a large polygon into many small ones and intersecting each with a large set of points. But again, this only works if your spatial scale is appropriate: ie, a few polygons and a lot of points (such as a few zip code polygons and millions of points in and around them).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM