简体   繁体   English

如何在python中从Lat和Long获得带有关联邮政编码的Dataframe?

[英]How to get Dataframe with associated Zip codes from Lat and Long in python?

I am trying to reverse Geocode and get the zip codes. 我正在尝试撤消地理编码并获取邮政编码。 I have a table / dataframe that has 400,000 lat and long and I would like to get the zip code by parsing the lat,long. 我有一个表/数据框,它的经纬度为40万,我想通过解析经纬度来获取邮政编码。

Below is the sample dataframe 以下是样本数据框

data = {'Site 1': '31.336968, -109.560959',
        'Site 2': '31.347745, -108.229963',
        'Site 3': '32.277621, -107.734724',
        'Site 4': '31.655494, -106.420484',
        'Site 5': '30.295053, -104.014528'}

My code: 我的代码:

import geopandas as gpd
from shapely.geometry import Point
gdf_locations = gpd.read_file('/Users/admin/Downloads/tl_2016_us_zcta510/tl_2016_us_zcta510.shp')

I downloaded the tl_2016_us_zcta510.shp file from here. 我从这里下载了tl_2016_us_zcta510.shp文件

This is where I am getting stuck. 这就是我被困住的地方。 One solution that I tried was to convert it to NUMPY array and pass the value. 我尝试的一种解决方案是将其转换为NUMPY数组并传递值。 But that seems extremely slow. 但这似乎非常缓慢。 I would like to do it as a dataframe using lambda and get the results quickly. 我想将其作为使用lambda的数据框并快速获得结果。

What I tried: 我试过的

               [Longitude]    [Latitude]
x = np.array((-73.986946106, 40.284328461))
x_pnt = Point(x)
filter = gdf_locations['geometry'].contains(x_pnt)
print(gdf_locations.loc[filter, 'GEOID10'])

While this is giving me what I want, it is extremely slow. 虽然这给了我想要的东西,但是它非常慢。 How can I make it faster and as a recursive function? 如何使它更快并作为递归函数? Any help is appreciated. 任何帮助表示赞赏。 Thank you. 谢谢。

PS: I have seen many blog posts and read stuff on this subject, but none seems to address it for large scale real-time implementation. PS:我已经看过许多博客文章和有关该主题的文章,但是似乎没有人针对大规模的实时实现解决它。

Edits: I am specifically looking to get a dataframe with following structure: 编辑:我特别希望获得具有以下结构的数据框:

 data = {'Site 1': '31.336968, -109.560959', 94108,
            'Site 2': '31.347745, -108.229963', 60616,
            'Site 3': '32.277621, -107.734724', 78654,
            'Site 4': '31.655494, -106.420484', 78090,
            'Site 5': '30.295053, -104.014528', 78901}

I understand how to convert lat long to Zip, what I'm not able to do is get a dataframe. 我了解如何将经纬度长期转换为Zip,但我无法做到的是获取数据帧。 Hope this makes it more clear. 希望这使事情更加清楚。

I haven't used geopandas very much, but I would try using scipy's cKDTree . 我还没有使用过Geopandas,但是我会尝试使用scipy的cKDTree It should be very fast for the amount of data you have. 对于您拥有的数据量,它应该非常快。 The only thing is that it works for point to point lookups, so you'd have to use the centroids of the polygons from the zip code dataset. 唯一的事情是它适用于点对点查找,因此您必须使用邮政编码数据集中的多边形的质心。

For example, transforming the centroids, which are Shapely points, to a numpy array: 例如,将质心点的质心转换为numpy数组:

centroids = gdf_locations.centroid

# transform shapely points to np array
point_array = []
for centroid in centroids:
    point_array.append([centroid.x, centroid.y])
point_array = np.array(point_array)
print(point_array[0])
>>> array([-83.61511443,  41.31279856])

To make sure it was going to be a quick lookup I create 400,000 random coordinates: 为确保快速查找,我创建了400,000个随机坐标:

random_lat_long = np.random.randn(400000, 2) * 80
print(random_lat_long)
>>> array([ -8.37429385, -23.19458311])

Now for the closest point: 现在为最接近的点:

distance, index = spatial.cKDTree(point_array).query(random_lat_long)

On my computer using %%timeit in Jupyter this took about 1.7 seconds. 在我的计算机中,使用Jupyter中的%%timeit ,这大约花费了1.7秒。

And finally, grabbing the zip codes from the dataframe: 最后,从数据框中获取邮政编码:

zip_codes = gdf_locations.loc[index, 'GEOID10']

Edit: To get the latitude and longitude as part of the result: 编辑:将纬度和经度作为结果的一部分:

Pull out data and convert types: 提取数据并转换类型:

lats_lons_zips = gdf_locations.loc[index, ['INTPTLAT10', 'INTPTLON10', 'GEOID10']]
# keep the zip code as an str to preserve leading zeros
lats_lons_zips = lats_lons_zips.astype({"INTPTLAT10": float, "INTPTLON10": float, "GEOID10": str})

Change index to "Site XX": 将索引更改为“站点XX”:

new_index = ["Site " + str(i) for i in range(len(lats_lons_zips))]
lats_lons_zips.index = new_index

Finally, get results: 最后,得到结果:

print(lats_lons_zips.iloc[0:4].to_dict(orient="index"))
{ 'Site 0': { 'GEOID10': '00824',
              'INTPTLAT10': 17.7445568,
              'INTPTLON10': -64.6829328},
  'Site 1': { 'GEOID10': '96916',
              'INTPTLAT10': 13.2603723,
              'INTPTLON10': 144.7006789},
  'Site 2': { 'GEOID10': '96916',
              'INTPTLAT10': 13.2603723,
              'INTPTLON10': 144.7006789},
  'Site 3': { 'GEOID10': '04741',
              'INTPTLAT10': 47.453712,
              'INTPTLON10': -69.2229208}}

Maybe you need geopandas.sjoin . 也许您需要geopandas.sjoin

In a Spatial Join, two geometry objects are merged based on their spatial relationship to one another. 在空间连接中,两个几何对象基于它们之间的空间关系进行合并。

First you need to prepare the site data to geoDataFrame. 首先,您需要将站点数据准备为geoDataFrame。

import geopandas as gpd
import pandas as pd
from shapely.geometry import Point

gdf_locations = gpd.read_file('tempdata/tl_2016_us_zcta510.shp')
data = {'Site 1': '31.336968, -109.560959',
        'Site 2': '31.347745, -108.229963',
        'Site 3': '32.277621, -107.734724',
        'Site 4': '31.655494, -106.420484',
        'Site 5': '30.295053, -104.014528'}
df_site = pd.DataFrame.from_dict(data, orient='index',columns=['locstr'])
df_site['loc'] = df_site['locstr'].apply(lambda x: list(map(float,x.split(','))))
df_site['loc'] = df_site['loc'].apply(lambda x: Point(x[1],x[0]))
gdf_site = gpd.GeoDataFrame(df_site,geometry=df_site['loc'],crs=gdf_locations.crs).drop(['loc'], axis=1)
print(gdf_site)

                        locstr                       geometry
Site 1  31.336968, -109.560959  POINT (-109.560959 31.336968)
Site 2  31.347745, -108.229963  POINT (-108.229963 31.347745)
Site 3  32.277621, -107.734724  POINT (-107.734724 32.277621)
Site 4  31.655494, -106.420484  POINT (-106.420484 31.655494)
Site 5  30.295053, -104.014528  POINT (-104.014528 30.295053)

Then you can set op='intersects' to judge the Spatial Relation between site and shp. 然后,您可以设置op='intersects'来判断站点和shp之间的空间关系。

intersects: The attributes will be joined if the boundary and interior of the object intersect in any way with the boundary and/or interior of the other object. 相交:如果对象的边界和内部以任何方式与另一个对象的边界和/或内部相交,则属性将被合并。

gdf_site = gpd.sjoin(gdf_site,gdf_locations,how='left',op='within')
print(gdf_site[['locstr','GEOID10']])

                        locstr GEOID10
Site 1  31.336968, -109.560959   85607
Site 2  31.347745, -108.229963   88040
Site 3  32.277621, -107.734724   88030
Site 4  31.655494, -106.420484     NaN
Site 5  30.295053, -104.014528   79843

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM