简体   繁体   English

仅当满足特定条件/计算时,如何使用 pandas/Python 对数据进行分组和聚合?

[英]How to group and aggregate data using pandas/Python only if a specific condition/calculation is met?

There is a pandas.DataFrame df that looks like this:有一个 pandas.DataFrame df看起来像这样:

City     Country   Latitude    Longitude      Population   ...

Berlin   Germany   52.516602   13.304105      118704
Berlin   Germany   52.430884   13.192662      292000
...
Berlin   USA       39.7742446  -75.0013423    7588
Berlin   USA       43.9727912  -88.9858084    5524

I would like to group data by columns City and Country and sum up their population:我想按列CityCountry对数据进行分组并总结他们的人口:

grouped_data = df.groupby([df['City'], df['Country'])['Population'].agg('sum').reset_index()

But in order to handle ambiguity – the two entries for USA are not to be merged –, my idea was to calculate and check the distance between lat/long for every potential groupby() -result.但为了处理歧义——美国的两个条目不会合并——我的想法是计算并检查每个潜在groupby()结果的纬度/经度之间的距离。

Assuming to have a distance function that returns the distance of two geographic points in kilometres, I'd like to group all entries by City and Country and sum up their population only if the result of distance() is eg less than 50 kilometres.假设有一个距离 function 以公里为单位返回两个地理点的距离,我想按城市和国家对所有条目进行分组,并仅当distance()的结果小于 50 公里时才总结它们的人口。

The output for the example above could look like:上述示例的 output 可能如下所示:

City    Country  Latitude                Longitude              Population

Berlin  Germany  [52.516602, 52.430884]  [13.304105, 13.192662] 410704
...
Berlin  USA      39.7742446              -75.0013423            7588
Berlin  USA      43.9727912              -88.9858084            5524

Any idea how to solve this in pandas?知道如何在 pandas 中解决这个问题吗? I am happy for your suggestions.我很高兴你的建议。

What you are asking for is rather a network problem where two nodes are connected if their distance is < 50 km.您所要求的是一个网络问题,如果两个节点的距离小于 50 公里,则它们会被连接。 In doing so, you can create a distance matrix and build up the graph with networkx .这样做时,您可以创建一个距离矩阵并使用networkx构建图形。 Something along this line:沿着这条线的东西:

from sklearn.metrics.pairwise import haversine_distances as haversine

# calculate haversine
dist_mat = haversine(np.deg2rad(df[['Latitude','Longitude']]) ) * 6371  # earth's radius

adjacency = dist_mat < 50

import networkx as nx
G = nx.from_numpy_matrix(adjacency)
components = nx.connected_components(G)

And then you can groupby on that components然后你可以对这些components进行分组

On the other hand, it might be easier for you to allow binning of the Lat/Long and groupby on those bins.另一方面,您可能更容易允许在这些 bin 上合并 Lat/Long 和 groupby。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM