[英]Selecting rows from pandas dataframe based on cKDTree indices
I was trying to do some quick-and-dirty reverse geocoding. 我正在尝试进行一些快速而又肮脏的反向地理编码。
I have the dataframe poi
(around 50,000 rows), where each point of interest has a lat/lng coordinate. 我有数据框
poi
(大约50,000行),其中每个关注点都有lat / lng坐标。
I have also the dataframe postcode_existing
(around 180,000 rows), which maps lat/lng coordinates to postcodes. 我还有dataframe
postcode_existing
(大约180,000行),它将lat / lng坐标映射到邮政编码。
I pulled out the relevant coordinate columns and used cKDTree to determine, for each point of interest in poi
, the nearest lat/lng coordinate in postcode_existing
. 我拉出了相关的坐标列,并使用cKDTree为
poi
每个兴趣点确定postcode_existing
最近的经/纬度坐标。
import pandas as pd
import numpy as np
from scipy.spatial import cKDTree
# read poi and postcode csv files
# Extract subset
postcode_existing_coordinates = postcode_existing[['Latitude', 'Longitude']]
# Extract subset
poi_coordinates = poi[['Latitude', 'Longitude']]
# Construct tree
tree = cKDTree(postcode_existing_coordinates)
# Query
distances, indices = tree.query(poi_coordinates)
I end up with the relevant indices. 我最后给出了相关索引。 I am now looking to select the rows from the dataframe
postcode_existing
using those indices. 我现在正在寻找使用这些索引从数据框
postcode_existing
选择行。
I tried postcode_existing.ix[indices]
, but this seems not to get the correct rows. 我尝试了
postcode_existing.ix[indices]
,但这似乎无法获得正确的行。
For example: 例如:
>>> postcode_existing.ix[indices].head()
Postcode Latitude Longitude Easting Northing GridRef \
78579 HA3 0NS 51.57553 -0.304296 517605.0 187658.0 TQ176876
178499 NaN NaN NaN NaN NaN NaN
62392 NaN NaN NaN NaN NaN NaN
78662 HA3 0TA 51.58409 -0.288764 518659.0 188635.0 TQ186886
79470 NaN NaN NaN NaN NaN NaN
County District Ward DistrictCode ... Terminated \
78579 Greater London Brent Kenton E09000005 ... NaN
178499 NaN NaN NaN NaN ... NaN
62392 NaN NaN NaN NaN ... NaN
78662 Greater London Brent Kenton E09000005 ... NaN
79470 NaN NaN NaN NaN ... NaN
Parish NationalPark Population Households Built up area \
78579 NaN NaN 72.0 25.0 Greater London
178499 NaN NaN NaN NaN NaN
62392 NaN NaN NaN NaN NaN
78662 NaN NaN 152.0 39.0 Greater London
79470 NaN NaN NaN NaN NaN
Built up sub-division Lower layer super output area \
78579 Brent Brent 004D
178499 NaN NaN
62392 NaN NaN
78662 Brent Brent 003E
79470 NaN NaN
Rural/urban Region
78579 Urban major conurbation London
178499 NaN NaN
62392 NaN NaN
78662 Urban major conurbation London
79470 NaN NaN
[5 rows x 25 columns]
But: 但:
>>> postcode_existing.iloc[78579]
Postcode NW1 3AU
Latitude 51.5237
Longitude -0.143188
Easting 528915
Northing 182163
GridRef TQ289821
County Greater London
District Westminster
Ward Marylebone High Street
DistrictCode E09000033
WardCode E05000641
Country England
CountyCode E11000009
Constituency Cities of London and Westminster
Introduced 1980-01-01
Terminated NaN
Parish NaN
NationalPark NaN
Population 7
Households 1
Built up area Greater London
Built up sub-division City of Westminster
Lower layer super output area Westminster 013A
Rural/urban Urban major conurbation
Region London
Name: 133733, dtype: object
Also: 也:
>>> postcode_existing.iloc[178499]
Postcode WC1E 6JL
Latitude 51.5236
Longitude -0.135522
Easting 529447
Northing 182168
GridRef TQ294821
County Greater London
District Camden
Ward Bloomsbury
DistrictCode E09000007
WardCode E05000129
Country England
CountyCode E11000009
Constituency Holborn and St Pancras
Introduced 1980-01-01
Terminated NaN
Parish NaN
NationalPark NaN
Population 1
Households 1
Built up area Greater London
Built up sub-division Camden
Lower layer super output area Camden 026D
Rural/urban Urban major conurbation
Region London
Name: 307029, dtype: object
These appear to be correct. 这些似乎是正确的。
Why does postcode_existing.ix[indices]
not select the correct rows? 为什么
postcode_existing.ix[indices]
没有选择正确的行? What should I be using instead? 我应该改用什么?
I solved the problem. 我解决了问题。 The issue was a mismatch between the position in the dataframe and the index due to the removal of certain rows.
问题是由于删除了某些行,导致数据框中的位置与索引之间的不匹配。
To fix this, I simply reset the index: 要解决此问题,我只需重置索引:
postcode_existing.reset_index(inplace=True, drop=True)
I was then able to use loc
to extract the relevant rows: 然后,我可以使用
loc
提取相关行:
postcode_existing.loc[indices]
The problem is that you are using integers in your index. 问题是您在索引中使用整数。 This messes things up as pandas attempts to keep track of list based locations as well as labels.
当熊猫试图跟踪基于列表的位置以及标签时,这使事情变得混乱。
ix
attempts to figure it out. ix
试图弄清楚。 It is interpreting indices
as list locations. 它会将
indices
解释为列表位置。 In this case, use loc
在这种情况下,请使用
loc
DataFrame.ix A primarily label-location based indexer, with integer position fallback.
DataFrame.ix主要基于标签位置的索引器,具有整数位置回退。
.ix[] supports mixed integer and label based access.
.ix []支持基于整数和标签的混合访问。 It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.
它主要基于标签,但是将退回到整数位置访问,除非相应的轴是整数类型。
.ix is the most general indexer and will support any of the inputs in .loc and .iloc.
.ix是最通用的索引器,它将支持.loc和.iloc中的任何输入。 .ix also supports floating point label schemes.
.ix还支持浮点标签方案。 .ix is exceptionally useful when dealing with mixed positional and label based hierachical indexes.
.ix在处理混合的位置和基于标签的层次索引时特别有用。
However, when an axis is integer based, ONLY label based access and not positional access is supported.
但是,当轴基于整数时,仅支持基于标签的访问,而不支持基于位置的访问。 Thus, in such cases, it's usually better to be explicit and use
.iloc
or.loc
.因此,在这种情况下,通常最好是显式并使用
.iloc
或.loc
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.