[英]Clustering with longitude and latitude with python
I have a dataset with longitude and latitude information and I need a way to cluster my data if the distance between observations is less than 300m.我有一个包含经度和纬度信息的数据集,如果观察之间的距离小于 300m,我需要一种方法来聚类我的数据。 Anyone has any idea?
有人有什么想法吗? I tried this:
我试过这个:
import pandas as pd
mydata=pd.read_csv("C:\\Users\\Gooljarsd\\Downloads\\restaurantes.csv")
bogota=mydata[(mydata['CITY']=="Bogota")]
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
X=bogota[['LAT','LNG']].values
print(X)
Z = linkage(X,
method='ward',
metric='euclidean'
)
but I got this error:但我收到了这个错误:
MemoryError: Unable to allocate 4.56 GiB for an array with shape (611852671,) and data type float64 MemoryError:无法为形状为 (611852671,) 且数据类型为 float64 的数组分配 4.56 GiB
I would create a KDTree and then return the indices of points within a certain distance and it should definitely help with your memory issues:我会创建一个 KDTree,然后返回一定距离内的点的索引,它肯定会帮助您解决 memory 问题:
from sklearn.neighbors import KDTree
import numpy as np
import pandas as pd
tree_data = my_data[['LAT', 'LNG']]
tree = KDTree(tree_data, metric='haversine')
bogota = my_data.loc[(my_data.CITY == 'Bogota'), ['LAT', 'LNG']]
idx, dist = tree.query_radius(X=bogota, r=300, return_distance=True)
There are plenty of other methods you can use, for example query()
will return n nearest neighbours, so it will be essentially a KNN but with less memory use.您可以使用许多其他方法,例如
query()
将返回 n 个最近的邻居,因此它本质上是一个 KNN,但使用较少的 memory。
EDIT: Since your data is in lat and long form, the standard Euclidean metric won't work and Haversine metric should be used instead.编辑:由于您的数据是经纬度格式,标准欧几里得度量将不起作用,而应使用Haversine 度量。 See other valid metrics here: enter link description here
在此处查看其他有效指标:在此处输入链接描述
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.