groupby元素的平均值python

Question

so I have this list looks like that: 所以我有这个列表看起来像这样：

58308.803701    132.227.127.170 50602   149.13.32.15      443   6   64
58308.815456    149.13.32.15    443     132.227.127.170   50602 6   60
58308.815524    132.227.127.170 50602   149.13.32.15      443   6   52
58308.817244    132.227.127.170 50602   149.13.32.15      443   6   57
58308.828987    149.13.32.15    443     132.227.127.170   50602 6   52
58308.829133    149.13.32.15    443     132.227.127.170   50602 6   57
58308.829169    132.227.127.170 50602   149.13.32.15      443   6   52
58308.912361    132.227.127.170 50603   86.4.136.93       443   6   64
58308.912497    132.227.127.170 50599   94.31.112.216     443   6   95
58308.912568    132.227.127.170 50599   94.31.112.216     443   6   96
58308.912977    132.227.127.170 50599   94.31.112.216     443   6   847
58308.913411    132.227.127.170 50599   94.31.112.216     443   6   154
58308.913484    132.227.127.170 50599   94.31.112.216     443   6   233
....
....
....

and I want to group each similar lines (with the same five columns in the middle) and show in the output the minimal of the first column and the average,median,mean,min,max,...(all possible statistic metrics) like the following: 我想将每行相似的行分组（中间有相同的五列），并在输出中显示第一列的最小值以及平均值，中位数，均值，最小值，最大值...（所有可能的统计指标）如下所示：

58308.803701                            132.227.127.170 50602   149.13.32.15      443   6   64
58308.815456                            149.13.32.15    443     132.227.127.170   50602 6   60
min of(58308.815524,58308.817244)       132.227.127.170 50602   149.13.32.15      443   6   min/max/avg/...of(52,57)
min of(58308.828987,58308.829133)       149.13.32.15    443     132.227.127.170   50602 6   min/max/avg/...of(52,57)
58308.829169                            132.227.127.170 50602   149.13.32.15      443   6   52
58308.912361                            132.227.127.170 50603   86.4.136.93       443   6   64
min of(58308.912497,..,58308.913484)    132.227.127.170 50599   94.31.112.216     443   6   min/max/avg/...of(95,96,847,154,233)
....
....
....

so here is the code I wrote so far and trying to make it work: 所以这是我到目前为止编写的代码，并试图使其工作：

from itertools import groupby 
import re 
import numpy as np

tstFile=open("output","w+") 
with open('dataInput','r') as d:
      f1 = ([x for x in line.split()] for line in d)
      for a,b in groupby(f1,key=lambda x:x[1:6]):
          tstFile.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\n" %(min(x[0] for x in b)),min(x[6] for x in b)),max(x[6] for x in b)),np.average(x[6] for x in b)),np.mean(x[6] for x in b)),np.median(x[6] for x in b)),np.std(x[6] for x in b)))
tstFile.close()

but nothing really seems to work, it only work for the min and max but to get each result I have to only use one argument... like this : 但实际上似乎没有任何作用，它仅适用于最小值和最大值，但要获得每个结果，我只需要使用一个参数即可，如下所示：

tstFile=open("output","w+")
with open('dataInput','r') as d:
    f1 = ([x for x in line.split()] for line in d)
    for a,b in groupby(f1,key=lambda x:x[1:6]):
        tstFile.write("%s\n" %(min(x[6] for x in b)))
tstFile.close()

Any help PLEASE ! 任何帮助，请！

Answer 1

When dealing with csv-files, it's generally advised to use the csv module . 处理csv文件时，通常建议使用csv模块。 I've included a sample code below which demonstrates how you could solve this problem. 我在下面提供了一个示例代码，该示例代码演示了如何解决此问题。

If your input file is tab-delimited, change to delimiter='\\t' and remove the skipinitialspace=True in csv.reader - the tabs weren't present in the sample input, but they may have disappeared during copy/paste. 如果您的输入文件是制表符分隔的，请更改为delimiter='\\t'并删除skipinitialspace=True中的skipinitialspace=True csv.reader示例输入中没有这些选项卡，但是在复制/粘贴过程中这些选项卡可能已消失。

import csv
from itertools import groupby
import numpy as np

with open('data.csv') as in_file, open('out.csv', 'wb') as out_file:
    reader = csv.reader(in_file, delimiter=' ', skipinitialspace=True)
    writer = csv.writer(out_file, delimiter='\t')
    for key, group in groupby(reader, key=lambda r: r[1:6]):
        col0, col6 = np.array(list(group))[:, [0, 6]].transpose().astype(float)
        writer.writerow([min(col0)] + key + [int(min(col6)), int(max(col6)),
                                             np.mean(col6)])

Output (I added some tabs to increase readability): 输出（我添加了一些选项卡以提高可读性）：

58308.803701    132.227.127.170 50602   149.13.32.15    443     6   64  64  64.0
58308.815456    149.13.32.15    443     132.227.127.170 50602   6   60  60  60.0
58308.815524    132.227.127.170 50602   149.13.32.15    443     6   52  57  54.5
58308.828987    149.13.32.15    443     132.227.127.170 50602   6   52  57  54.5
58308.829169    132.227.127.170 50602   149.13.32.15    443     6   52  52  52.0
58308.912361    132.227.127.170 50603   86.4.136.93     443     6   64  64  64.0
58308.912497    132.227.127.170 50599   94.31.112.216   443     6   95  847 285.0

groupby元素的平均值python

问题描述

1 个解决方案

解决方案1
0 已采纳 2014-05-14 14:53:53

groupby元素的平均值python

问题描述

1 个解决方案

解决方案1 0 已采纳 2014-05-14 14:53:53

解决方案1
0 已采纳 2014-05-14 14:53:53