简体   繁体   English

Python 3.6。获得所有相同X坐标的平均Y.

[英]Python 3.6. Get average Y for all same X coordinates

I have a list of coordinates that looks like this: 我有一个坐标列表,如下所示:

my_list = [[1, 1], [1, 3], [1, 5], [2, 1], [2, 3]]

As we see, there are same X values for first three coordinates with different Y and same situation for another two coordiantes. 正如我们所看到的,前三个坐标有相同的X值,不同的Y和另外两个坐标的相同情况。 I want to make new list which will look like this: 我想制作一个如下所示的新列表:

new_list = [[1, 3], [2, 2]]

where y1 = 3 = (1+3+5)/3 and y2 = 2 = (1+3)/2 . 其中y1 = 3 = (1+3+5)/3y2 = 2 = (1+3)/2 I have written my code which is below, but it works slowly. 我编写了下面的代码,但工作缓慢。

I work with hundreds of thousands coordinates so the question is: How to make this code work faster? 我使用数十万个坐标,所以问题是:如何让这段代码更快地运行? Is there any optimization or special open source libraty, which can speed up my code? 是否有任何优化或特殊的开源库,可以加快我的代码?

Thank you in advance. 先感谢您。

x_mass = []

for m in mass:
  x_mass.append(m[0])

set_x_mass = set(x_mass) 
list_x_mass = list(set_x_mass) 

performance_points = [] 

def function(i):
  unique_x_mass = []
  for m in mass:
    if m[0] == i:
      unique_x_mass.append(m)

  summ_y = 0
  for m in unique_x_mass:
    summ_y += m[1]
  point = [float(m[0]), float(summ_y/len(unique_x_mass))] 
  performance_points.append(point)
  return performance_points

for x in list_x_mass:
  function(x)

Create DataFrame and aggregate mean : 创建DataFrame和聚合mean

L = [[1, 1], [1, 3], [1, 5], [2, 1], [2, 3]]

L1 = pd.DataFrame(L).groupby(0, as_index=False)[1].mean().values.tolist()
print (L1)
[[1, 3], [2, 2]]

The pandas solution offered by @jezrael is elegant but slow (like almost everything pandas). @jezrael提供的大熊猫解决方案优雅但缓慢(几乎所有的熊猫)。 I would suggest using modules itertools and statistics : 我建议使用模块itertoolsstatistics

from statistics import mean
from itertools import groupby

grouper = groupby(L, key=lambda x: x[0])
#The next line is again more elegant, but slower:
#grouper = groupby(L, key=operator.itemgetter(0))
[[x, mean(yi[1] for yi in y)] for x,y in grouper]

The result is, of course, the same. 结果当然是一样的。 The execution time for the sample list is two orders of magnitude faster. 样本列表的执行时间快两个数量级

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM