Python 3.6。获得所有相同X坐标的平均Y.

Question

I have a list of coordinates that looks like this: 我有一个坐标列表，如下所示：

my_list = [[1, 1], [1, 3], [1, 5], [2, 1], [2, 3]]

As we see, there are same X values for first three coordinates with different Y and same situation for another two coordiantes. 正如我们所看到的，前三个坐标有相同的X值，不同的Y和另外两个坐标的相同情况。 I want to make new list which will look like this: 我想制作一个如下所示的新列表：

new_list = [[1, 3], [2, 2]]

where y1 = 3 = (1+3+5)/3 and y2 = 2 = (1+3)/2 . 其中y1 = 3 = (1+3+5)/3 ， y2 = 2 = (1+3)/2 。 I have written my code which is below, but it works slowly. 我编写了下面的代码，但工作缓慢。

I work with hundreds of thousands coordinates so the question is: How to make this code work faster? 我使用数十万个坐标，所以问题是：如何让这段代码更快地运行？ Is there any optimization or special open source libraty, which can speed up my code? 是否有任何优化或特殊的开源库，可以加快我的代码？

Thank you in advance. 先感谢您。

x_mass = []

for m in mass:
  x_mass.append(m[0])

set_x_mass = set(x_mass) 
list_x_mass = list(set_x_mass) 

performance_points = [] 

def function(i):
  unique_x_mass = []
  for m in mass:
    if m[0] == i:
      unique_x_mass.append(m)

  summ_y = 0
  for m in unique_x_mass:
    summ_y += m[1]
  point = [float(m[0]), float(summ_y/len(unique_x_mass))] 
  performance_points.append(point)
  return performance_points

for x in list_x_mass:
  function(x)

Answer 1

Create DataFrame and aggregate mean : 创建DataFrame和聚合mean ：

L = [[1, 1], [1, 3], [1, 5], [2, 1], [2, 3]]

L1 = pd.DataFrame(L).groupby(0, as_index=False)[1].mean().values.tolist()
print (L1)
[[1, 3], [2, 2]]

Answer 2

The pandas solution offered by @jezrael is elegant but slow (like almost everything pandas). @jezrael提供的大熊猫解决方案优雅但缓慢（几乎所有的熊猫）。 I would suggest using modules itertools and statistics : 我建议使用模块itertools和statistics ：

from statistics import mean
from itertools import groupby

grouper = groupby(L, key=lambda x: x[0])
#The next line is again more elegant, but slower:
#grouper = groupby(L, key=operator.itemgetter(0))
[[x, mean(yi[1] for yi in y)] for x,y in grouper]

The result is, of course, the same. 结果当然是一样的。 The execution time for the sample list is two orders of magnitude faster. 样本列表的执行时间快两个数量级 。

Python 3.6。获得所有相同X坐标的平均Y.

问题描述

2 个解决方案

解决方案1
4 已采纳 2018-07-10 07:00:18

解决方案2
3 2018-07-10 07:07:35

Python 3.6。获得所有相同X坐标的平均Y.

问题描述

2 个解决方案

解决方案1 4 已采纳 2018-07-10 07:00:18

解决方案2 3 2018-07-10 07:07:35

解决方案1
4 已采纳 2018-07-10 07:00:18

解决方案2
3 2018-07-10 07:07:35