简体   繁体   English

使用 python 进行经纬度聚类

[英]Clustering with longitude and latitude with python

I have a dataset with longitude and latitude information and I need a way to cluster my data if the distance between observations is less than 300m.我有一个包含经度和纬度信息的数据集,如果观察之间的距离小于 300m,我需要一种方法来聚类我的数据。 Anyone has any idea?有人有什么想法吗? I tried this:我试过这个:

import pandas as pd
mydata=pd.read_csv("C:\\Users\\Gooljarsd\\Downloads\\restaurantes.csv")
bogota=mydata[(mydata['CITY']=="Bogota")]

import matplotlib.pyplot as plt

from scipy.cluster.hierarchy import dendrogram, linkage

X=bogota[['LAT','LNG']].values
print(X)

Z = linkage(X,
            method='ward',
            metric='euclidean'
    ) 

but I got this error:但我收到了这个错误:

MemoryError: Unable to allocate 4.56 GiB for an array with shape (611852671,) and data type float64 MemoryError:无法为形状为 (611852671,) 且数据类型为 float64 的数组分配 4.56 GiB

I would create a KDTree and then return the indices of points within a certain distance and it should definitely help with your memory issues:我会创建一个 KDTree,然后返回一定距离内的点的索引,它肯定会帮助您解决 memory 问题:

from sklearn.neighbors import KDTree

import numpy as np
import pandas as pd

tree_data = my_data[['LAT', 'LNG']]
tree = KDTree(tree_data, metric='haversine')

bogota = my_data.loc[(my_data.CITY == 'Bogota'), ['LAT', 'LNG']]
idx, dist = tree.query_radius(X=bogota, r=300, return_distance=True)

There are plenty of other methods you can use, for example query() will return n nearest neighbours, so it will be essentially a KNN but with less memory use.您可以使用许多其他方法,例如query()将返回 n 个最近的邻居,因此它本质上是一个 KNN,但使用较少的 memory。

EDIT: Since your data is in lat and long form, the standard Euclidean metric won't work and Haversine metric should be used instead.编辑:由于您的数据是经纬度格式,标准欧几里得度量将不起作用,而应使用Haversine 度量。 See other valid metrics here: enter link description here在此处查看其他有效指标:在此处输入链接描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM