使用 python 进行经纬度聚类

Question

I have a dataset with longitude and latitude information and I need a way to cluster my data if the distance between observations is less than 300m.我有一个包含经度和纬度信息的数据集，如果观察之间的距离小于 300m，我需要一种方法来聚类我的数据。 Anyone has any idea?有人有什么想法吗？ I tried this:我试过这个：

import pandas as pd
mydata=pd.read_csv("C:\\Users\\Gooljarsd\\Downloads\\restaurantes.csv")
bogota=mydata[(mydata['CITY']=="Bogota")]

import matplotlib.pyplot as plt

from scipy.cluster.hierarchy import dendrogram, linkage

X=bogota[['LAT','LNG']].values
print(X)

Z = linkage(X,
            method='ward',
            metric='euclidean'
    )

but I got this error:但我收到了这个错误：

MemoryError: Unable to allocate 4.56 GiB for an array with shape (611852671,) and data type float64 MemoryError：无法为形状为 (611852671,) 且数据类型为 float64 的数组分配 4.56 GiB

Answer 1

I would create a KDTree and then return the indices of points within a certain distance and it should definitely help with your memory issues:我会创建一个 KDTree，然后返回一定距离内的点的索引，它肯定会帮助您解决 memory 问题：

from sklearn.neighbors import KDTree

import numpy as np
import pandas as pd

tree_data = my_data[['LAT', 'LNG']]
tree = KDTree(tree_data, metric='haversine')

bogota = my_data.loc[(my_data.CITY == 'Bogota'), ['LAT', 'LNG']]
idx, dist = tree.query_radius(X=bogota, r=300, return_distance=True)

There are plenty of other methods you can use, for example query() will return n nearest neighbours, so it will be essentially a KNN but with less memory use.您可以使用许多其他方法，例如query()将返回 n 个最近的邻居，因此它本质上是一个 KNN，但使用较少的 memory。

EDIT: Since your data is in lat and long form, the standard Euclidean metric won't work and Haversine metric should be used instead.编辑：由于您的数据是经纬度格式，标准欧几里得度量将不起作用，而应使用Haversine 度量。 See other valid metrics here: enter link description here在此处查看其他有效指标：在此处输入链接描述

使用 python 进行经纬度聚类

问题描述

1 个解决方案

解决方案1
0 2021-01-13 05:42:17

使用 python 进行经纬度聚类

问题描述

1 个解决方案

解决方案1 0 2021-01-13 05:42:17

解决方案1
0 2021-01-13 05:42:17