简体   繁体   English

从文本文件的数据中取平均值

[英]To make averages from the data of a text file

I have a text file as follows where there are two columns in between strings:我有一个如下文本文件,其中字符串之间有两列:

1   23
2   29
3   21
4   18
5   19
6   18
7   19
8   24
Cluster analysis done for this configuration!

1   23
2   22
3   19
4   18
5   23
6   17
7   19
8   31
9   21
10   27
11   19
Cluster analysis done for this configuration!

1   22
2   26
3   27
4   23
5   25
6   32
7   23
8   19
9   19
10   18
11   30
12   21
13   23
14   16
Cluster analysis done for this configuration!

1   23
2   19
3   23
4   27
5   20
6   17
7   15
8   22
9   16
10   23
11   20
12   23
Cluster analysis done for this configuration!

The desired output would be:所需的 output 将是:

1 22.75
2 24.0
3 22.5
4 21.5
5 21.75
6 21.0
7 19.0
8 24.0
9 18.666666666666668
10 22.666666666666668
11 23.0
12 22.0
13 23.0
14 16.0

I would like to get an average for each of the numbers in the first column.我想获得第一列中每个数字的平均值。 If I take this example, the average value that corresponds to '1' would be: (23+23+22+23)/4 = 22.75 and so on for '2', '3'… Please note that the total numbers of rows are not the same in between the strings 'Cluster analysis….'如果我以这个例子为例,对应于“1”的平均值将是:(23+23+22+23)/4 = 22.75 等等,对于“2”,“3”……请注意,字符串 'Cluster analysis....' 之间的行不同but that's ok.但没关系。 For example, the average value for '14' would just be 16 in this case as there are no other numbers correspond to '14' except in '3rd' block.例如,在这种情况下,“14”的平均值仅为 16,因为除了“3rd”块之外,没有其他数字对应于“14”。

I was thinking along the line that somehow one needs to print all the numbers between the strings 'Cluster analysis….'我一直在想,不知何故需要打印字符串“聚类分析……”之间的所有数字。 and then maybe a store in an array or so and then just do an average but couldn't implement it in code.然后可能是在一个数组中存储,然后只是做一个平均但无法在代码中实现它。 Could anyone give me a lead?谁能给我一个领导?

I don't have any preference for the coding language;我对编码语言没有任何偏好; it just needs to solve the problem.它只需要解决问题。 I was thinking along with bash/shell but python is also welcome.我在考虑 bash/shell,但也欢迎 python。

awk '/^[0-9]+ +[0-9]+$/ { # pick only lines with two numbers
         arr[$1] += $2    # accumulate the numbers in indexed bins
         n[$1]++          # keep track of how may numbers are in each bin
     }
     END {                     # finally,
         for (e in arr)        # for each bin
             print arr[e]/n[e] # divide
     }' your_input_file

Here is a solution, assuming the data are contained in a string called 's'这是一个解决方案,假设数据包含在一个名为 's' 的字符串中

from collections import defaultdict

s = '1 23' #....etc

def list_struct():
    return list((int(), int()))

data = defaultdict(list_struct) # format: {id: [occurrences, total]}

for line in s.split('\n'):
    if line[0:1].isdigit(): # i'm assuming that all the lines that start with a number are the 'right' lines
        n, value = line.split()
        data[int(n)][0] += 1
        data[int(n)][1] += int(value)

for elem in data:
    print(elem, data[elem][1] / data[elem][0])

Output with your data: Output 与您的数据:

1 22.75
2 24.0
3 22.5
4 21.5
5 21.75
6 21.0
7 19.0
8 24.0
9 18.666666666666668
10 22.666666666666668
11 23.0
12 22.0
13 23.0
14 16.0

Edit:编辑:

To read from a file just change the for loop to:要从文件中读取,只需将 for 循环更改为:

with open('f.txt', 'r') as f:
    for line in f:
        if line[0:1].isdigit(): # i'm assuming that all the lines that start with a number are the 'right' lines
            n, value = line.split()
            data[int(n)][0] += 1
            data[int(n)][1] += int(value)

A gimmick with bash because the question originally had a bash tag. bash的噱头,因为该问题最初具有bash标签。

#!/bin/bash

div ()  # Arguments: dividend and divisor
{
  if [ $2 -eq 0 ]; then echo division by 0; exit; fi
  local p=15                            # precision
  local c=${c:-0}                       # precision counter
  local d=.                             # decimal separator
  local r=$(($1/$2)); echo -n $r        # result of division
  local m=$(($r*$2))
  [ $c -eq 0 ] && [ $m -ne $1 ] && echo -n $d
  [ $1 -eq $m ] || [ $c -eq $p ] && echo && return
  local e=$(($1-$m))
  c=$(($c+1))
  div $(($e*10)) $2
}

while read -r num val; do
  if [[ $num =~ ^[0-9] ]]; then
    a[$num]=$((a[$num]+$val))
    ((v[$num]++))
  fi
done < file

for((i=1; i<=${#a[@]}; i++)); do
  div ${a[$i]} ${v[$i]}
done

I used div function from there .我从那里使用了div function 。

Output: Output:

22.75
24
22.5
21.5
21.75
21
19
24
18.666666666666666
22.666666666666666
23
22
23
16

GNU datamash is a very handy tool for doing stats on groups of columnar data in scripts and one-liners. GNU datamash是一个非常方便的工具,用于对脚本和单行中的列式数据组进行统计。 The catches here are having to remove the non-data lines first, and sorting the input numerically to get output in the desired order.这里的问题必须首先删除非数据行,然后对输入进行数字排序,以按所需顺序获得 output。

$ sed '/^$/d; /Cluster/d' input.txt | sort -k1,1n | datamash -Wg1 mean 2
1       22.75
2       24
3       22.5
4       21.5
5       21.75
6       21
7       19
8       24
9       18.666666666667
10      22.666666666667
11      23
12      22
13      23
14      16

Here's a simple python implementation.这是一个简单的 python 实现。

# I'll store the output in a list of dicts
def get_averages(filename):
  results = []
  temp_dict = {'count':0, 'total':0, 'avg':0} # unnecessary; illustrate format
  for line in open(filename):
    words = line.rstrip('\n').split(' ')
    try: # if first word is a number
      if (int(words[0]) == 1): # interpret as start of new data list
        temp_dict = {'count':0, 'total':0, 'avg':0} # reset dict
      temp_dict['count'] += 1
      temp_dict['total'] += float(words[1])
    except ValueError:
      if (words[0] == 'Cluster'): # interpret as end of list
        temp_dict['avg']=temp_dict['total']/temp_dict['count']
        results.append(temp_dict)
      temp_dict = None # just to be safe
  return results

And then to get the desired form of your output:然后得到你想要的 output 形式:

results=get_averages(filename)
for i in range(len(results)):
  print('{} {}'.format(i,results[i]['avg']))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM