简体   繁体   English

根据cKDTree索引从熊猫数据框中选择行

[英]Selecting rows from pandas dataframe based on cKDTree indices

I was trying to do some quick-and-dirty reverse geocoding. 我正在尝试进行一些快速而又肮脏的反向地理编码。

I have the dataframe poi (around 50,000 rows), where each point of interest has a lat/lng coordinate. 我有数据框poi (大约50,000行),其中每个关注点都有lat / lng坐标。

I have also the dataframe postcode_existing (around 180,000 rows), which maps lat/lng coordinates to postcodes. 我还有dataframe postcode_existing (大约180,000行),它将lat / lng坐标映射到邮政编码。

I pulled out the relevant coordinate columns and used cKDTree to determine, for each point of interest in poi , the nearest lat/lng coordinate in postcode_existing . 我拉出了相关的坐标列,并使用cKDTree为poi每个兴趣点确定postcode_existing最近的经/纬度坐标。

import pandas as pd
import numpy as np
from scipy.spatial import cKDTree

# read poi and postcode csv files

# Extract subset
postcode_existing_coordinates = postcode_existing[['Latitude', 'Longitude']]

# Extract subset
poi_coordinates = poi[['Latitude', 'Longitude']]

# Construct tree
tree = cKDTree(postcode_existing_coordinates)

# Query
distances, indices = tree.query(poi_coordinates)

I end up with the relevant indices. 我最后给出了相关索引。 I am now looking to select the rows from the dataframe postcode_existing using those indices. 我现在正在寻找使用这些索引从数据框postcode_existing选择行。

I tried postcode_existing.ix[indices] , but this seems not to get the correct rows. 我尝试了postcode_existing.ix[indices] ,但这似乎无法获得正确的行。

For example: 例如:

>>> postcode_existing.ix[indices].head()
       Postcode  Latitude  Longitude   Easting  Northing   GridRef  \
78579   HA3 0NS  51.57553  -0.304296  517605.0  187658.0  TQ176876   
178499      NaN       NaN        NaN       NaN       NaN       NaN   
62392       NaN       NaN        NaN       NaN       NaN       NaN   
78662   HA3 0TA  51.58409  -0.288764  518659.0  188635.0  TQ186886   
79470       NaN       NaN        NaN       NaN       NaN       NaN   

                County District    Ward DistrictCode   ...   Terminated  \
78579   Greater London    Brent  Kenton    E09000005   ...          NaN   
178499             NaN      NaN     NaN          NaN   ...          NaN   
62392              NaN      NaN     NaN          NaN   ...          NaN   
78662   Greater London    Brent  Kenton    E09000005   ...          NaN   
79470              NaN      NaN     NaN          NaN   ...          NaN   

       Parish NationalPark Population Households   Built up area  \
78579     NaN          NaN       72.0       25.0  Greater London   
178499    NaN          NaN        NaN        NaN             NaN   
62392     NaN          NaN        NaN        NaN             NaN   
78662     NaN          NaN      152.0       39.0  Greater London   
79470     NaN          NaN        NaN        NaN             NaN   

       Built up sub-division  Lower layer super output area  \
78579                  Brent                     Brent 004D   
178499                   NaN                            NaN   
62392                    NaN                            NaN   
78662                  Brent                     Brent 003E   
79470                    NaN                            NaN   

                    Rural/urban  Region  
78579   Urban major conurbation  London  
178499                      NaN     NaN  
62392                       NaN     NaN  
78662   Urban major conurbation  London  
79470                       NaN     NaN  

[5 rows x 25 columns]

But: 但:

>>> postcode_existing.iloc[78579]
Postcode                                                  NW1 3AU
Latitude                                                  51.5237
Longitude                                               -0.143188
Easting                                                    528915
Northing                                                   182163
GridRef                                                  TQ289821
County                                             Greater London
District                                              Westminster
Ward                                       Marylebone High Street
DistrictCode                                            E09000033
WardCode                                                E05000641
Country                                                   England
CountyCode                                              E11000009
Constituency                     Cities of London and Westminster
Introduced                                             1980-01-01
Terminated                                                    NaN
Parish                                                        NaN
NationalPark                                                  NaN
Population                                                      7
Households                                                      1
Built up area                                      Greater London
Built up sub-division                         City of Westminster
Lower layer super output area                    Westminster 013A
Rural/urban                               Urban major conurbation
Region                                                     London
Name: 133733, dtype: object

Also: 也:

>>> postcode_existing.iloc[178499]
Postcode                                        WC1E 6JL
Latitude                                         51.5236
Longitude                                      -0.135522
Easting                                           529447
Northing                                          182168
GridRef                                         TQ294821
County                                    Greater London
District                                          Camden
Ward                                          Bloomsbury
DistrictCode                                   E09000007
WardCode                                       E05000129
Country                                          England
CountyCode                                     E11000009
Constituency                      Holborn and St Pancras
Introduced                                    1980-01-01
Terminated                                           NaN
Parish                                               NaN
NationalPark                                         NaN
Population                                             1
Households                                             1
Built up area                             Greater London
Built up sub-division                             Camden
Lower layer super output area                Camden 026D
Rural/urban                      Urban major conurbation
Region                                            London
Name: 307029, dtype: object

These appear to be correct. 这些似乎是正确的。

Why does postcode_existing.ix[indices] not select the correct rows? 为什么postcode_existing.ix[indices]没有选择正确的行? What should I be using instead? 我应该改用什么?

I solved the problem. 我解决了问题。 The issue was a mismatch between the position in the dataframe and the index due to the removal of certain rows. 问题是由于删除了某些行,导致数据框中的位置与索引之间的不匹配。

To fix this, I simply reset the index: 要解决此问题,我只需重置索引:

postcode_existing.reset_index(inplace=True, drop=True)

I was then able to use loc to extract the relevant rows: 然后,我可以使用loc提取相关行:

postcode_existing.loc[indices]

The problem is that you are using integers in your index. 问题是您在索引中使用整数。 This messes things up as pandas attempts to keep track of list based locations as well as labels. 当熊猫试图跟踪基于列表的位置以及标签时,这使事情变得混乱。 ix attempts to figure it out. ix试图弄清楚。 It is interpreting indices as list locations. 它会将indices解释为列表位置。 In this case, use loc 在这种情况下,请使用loc

Documentation 文献资料

DataFrame.ix A primarily label-location based indexer, with integer position fallback. DataFrame.ix主要基于标签位置的索引器,具有整数位置回退。

.ix[] supports mixed integer and label based access. .ix []支持基于整数和标签的混合访问。 It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type. 它主要基于标签,但是将退回到整数位置访问,除非相应的轴是整数类型。

.ix is the most general indexer and will support any of the inputs in .loc and .iloc. .ix是最通用的索引器,它将支持.loc和.iloc中的任何输入。 .ix also supports floating point label schemes. .ix还支持浮点标签方案。 .ix is exceptionally useful when dealing with mixed positional and label based hierachical indexes. .ix在处理混合的位置和基于标签的层次索引时特别有用。

However, when an axis is integer based, ONLY label based access and not positional access is supported. 但是,当轴基于整数时,仅支持基于标签的访问,而不支持基于位置的访问。 Thus, in such cases, it's usually better to be explicit and use .iloc or .loc . 因此,在这种情况下,通常最好是显式并使用.iloc.loc

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM