简体   繁体   English

我需要帮助编写 AWK 脚本以按行分组并查找最小值/最大值/平均值

[英]I need help writing an AWK Script for grouping by row and finding min/max/avg of values

I am working in Bash and trying to write an Awk script that takes data from a CSV file, groups the data by rows and then get the min, max, and avg of the values.我正在 Bash 中工作并尝试编写一个 Awk 脚本,该脚本从 CSV 文件中获取数据,按行对数据进行分组,然后获取值的最小值、最大值和平均值。

Here is the complete CSV file:这是完整的 CSV 文件:

Student,Catehory,Assignment,Score,Possible
Chelsey,Homework,H01,90,100
Chelsey,Homework,H02,89,100
Chelsey,Homework,H03,77,100
Chelsey,Homework,H04,80,100
Chelsey,Homework,H05,82,100
Chelsey,Homework,H06,84,100
Chelsey,Homework,H07,86,100
Chelsey,Lab,L01,91,100
Chelsey,Lab,L02,100,100
Chelsey,Lab,L03,100,100
Chelsey,Lab,L04,100,100
Chelsey,Lab,L05,96,100
Chelsey,Lab,L06,80,100
Chelsey,Lab,L07,81,100
Chelsey,Quiz,Q01,100,100
Chelsey,Quiz,Q02,100,100
Chelsey,Quiz,Q03,98,100
Chelsey,Quiz,Q04,93,100
Chelsey,Quiz,Q05,99,100
Chelsey,Quiz,Q06,88,100
Chelsey,Quiz,Q07,100,100
Chelsey,Final,FINAL,82,100
Chelsey,Survey,WS,5,5
Sam,Homework,H01,19,100
Sam,Homework,H02,82,100
Sam,Homework,H03,95,100
Sam,Homework,H04,46,100
Sam,Homework,H05,82,100
Sam,Homework,H06,97,100
Sam,Homework,H07,52,100
Sam,Lab,L01,41,100
Sam,Lab,L02,85,100
Sam,Lab,L03,99,100
Sam,Lab,L04,99,100
Sam,Lab,L05,0,100
Sam,Lab,L06,0,100
Sam,Lab,L07,0,100
Sam,Quiz,Q01,91,100
Sam,Quiz,Q02,85,100
Sam,Quiz,Q03,33,100
Sam,Quiz,Q04,64,100
Sam,Quiz,Q05,54,100
Sam,Quiz,Q06,95,100
Sam,Quiz,Q07,68,100
Sam,Final,FINAL,58,100
Sam,Survey,WS,5,5
Andrew,Homework,H01,25,100
Andrew,Homework,H02,47,100
Andrew,Homework,H03,85,100
Andrew,Homework,H04,65,100
Andrew,Homework,H05,54,100
Andrew,Homework,H06,58,100
Andrew,Homework,H07,52,100
Andrew,Lab,L01,87,100
Andrew,Lab,L02,45,100
Andrew,Lab,L03,92,100
Andrew,Lab,L04,48,100
Andrew,Lab,L05,42,100
Andrew,Lab,L06,99,100
Andrew,Lab,L07,86,100
Andrew,Quiz,Q01,25,100
Andrew,Quiz,Q02,84,100
Andrew,Quiz,Q03,59,100
Andrew,Quiz,Q04,93,100
Andrew,Quiz,Q05,85,100
Andrew,Quiz,Q06,94,100
Andrew,Quiz,Q07,58,100
Andrew,Final,FINAL,99,100
Andrew,Survey,WS,5,5
Ava,Homework,H01,55,100
Ava,Homework,H02,95,100
Ava,Homework,H03,84,100
Ava,Homework,H04,74,100
Ava,Homework,H05,95,100
Ava,Homework,H06,84,100
Ava,Homework,H07,55,100
Ava,Lab,L01,66,100
Ava,Lab,L02,77,100
Ava,Lab,L03,88,100
Ava,Lab,L04,99,100
Ava,Lab,L05,55,100
Ava,Lab,L06,66,100
Ava,Lab,L07,77,100
Ava,Quiz,Q01,88,100
Ava,Quiz,Q02,99,100
Ava,Quiz,Q03,44,100
Ava,Quiz,Q04,55,100
Ava,Quiz,Q05,66,100
Ava,Quiz,Q06,77,100
Ava,Quiz,Q07,88,100
Ava,Final,FINAL,99,100
Ava,Survey,WS,5,5
Shane,Homework,H01,50,100
Shane,Homework,H02,60,100
Shane,Homework,H03,70,100
Shane,Homework,H04,60,100
Shane,Homework,H05,70,100
Shane,Homework,H06,80,100
Shane,Homework,H07,90,100
Shane,Lab,L01,90,100
Shane,Lab,L02,0,100
Shane,Lab,L03,100,100
Shane,Lab,L04,50,100
Shane,Lab,L05,40,100
Shane,Lab,L06,60,100
Shane,Lab,L07,80,100
Shane,Quiz,Q01,70,100
Shane,Quiz,Q02,90,100
Shane,Quiz,Q03,100,100
Shane,Quiz,Q04,100,100
Shane,Quiz,Q05,80,100
Shane,Quiz,Q06,80,100
Shane,Quiz,Q07,80,100
Shane,Final,FINAL,90,100
Shane,Survey,WS,5,5

Basically, I have 5 student names and each student has completed a quiz, lab, homework for each lesson name, plus a survey and a final exam...基本上,我有 5 个学生姓名,每个学生都完成了每个课程名称的测验、实验室、作业,以及调查和期末考试......

What I am trying to do is group this by Assignment name and generate a report that shows the lowest score achieved for that assignment, the highest score and the average score...我想要做的是按作业名称对其进行分组,并生成一份报告,显示该作业的最低分数、最高分数和平均分数...

The output should be:输出应该是:

Name     Low     High  Avg
H02      66       99   74.22
L07      47       88   66.30

and include every individual assignment name from column 3 ($3).并包括第 3 列中的每个单独的作业名称($3)。 formatted using tab (/t)使用制表符 (/t) 格式化

The code I have pasted already outputs the headings and the 2 decimal places in avg column but the actual values are not correct.我粘贴的代码已经在 avg 列中输出了标题和 2 个小数位,但实际值不正确。

I have only two issues really:我真的只有两个问题:

  1. I cannot for the life of me get the min or max for the individual groupings.我一生都无法获得单个分组的最小值或最大值。 I know how to get the min/max and even the basic syntax for it, but how do I get it to the individual groups?我知道如何获得最小值/最大值,甚至是它的基本语法,但我如何将它传递给各个组?

  2. Scripting this.编写这个。 I have very limited experience using bash, or anything Linux for that matter and am unfamiliar with awk (though I am learning quite a bit now).我使用 bash 或任何 Linux 的经验非常有限,并且不熟悉 awk(尽管我现在学习了很多)。

So, to get myself at least started i wrote a one liner to achieve the grouping and the output formatting I am looking for, but it is only summing the scores for each group and the average is all messed up because I still have not figured out how to get the count of the scores to use as a divisor.所以,为了让自己至少开始,我写了一个单行代码来实现我正在寻找的分组和输出格式,但它只是对每个组的分数求和,平均值全部搞砸了,因为我还没有弄清楚如何获得分数的计数以用作除数。

Anyways, this is what I have:无论如何,这就是我所拥有的:

awk -F "," 'BEGIN{printf "Name\tLow\tHigh\tAvg\n"}
            NR>=2{a[$3]+=$4; b[$3]+=$4;c[$3]+=$4/FNR }
            END {for (i in b) printf "%-7s\t%d\t%d\t%.02f\n", i,a[i],b[i],c[i]}'  \
    score-data.csv

The output is perfect in that it is grouping by the assignment names, 2 decimals in the avg column and tabbed.... but the low and high are not correct and the average, as you can see is messed up.. tried dividing the sum by FNR.输出是完美的,因为它按作业名称分组,avg 列中的 2 位小数和选项卡......但低和高不正确,平均值,如你所见,一团糟.. 尝试划分FNR 的总和。 Have also tried NF and NR both... no luck.也尝试过 NF 和 NR 两者......不走运。 Again, I know how to get a count, but no clue how to get it in here.同样,我知道如何计算,但不知道如何计算。

So, if anyone can help me get the min/max/avg taken care of and also with the syntax for this to be a script, it would be appreciated因此,如果有人可以帮助我处理 min/max/avg 以及将此作为脚本的语法,我将不胜感激

I cannot comment for some reason, but I have searched google and read the man awk stuff and have two different tabs open in my browser to docs on awk.由于某种原因,我无法发表评论,但我已经搜索过谷歌并阅读了 man awk 的内容,并在我的浏览器中打开了两个不同的选项卡以查看 awk 上的文档。 None of them address it for my situation.他们都没有针对我的情况解决这个问题。

As far as the array naming goes, it is all the same array being used;就数组命名而言,使用的都是相同的数组; an associative array that uses column 3 as the index/key and values from column 4 as the key's values.一个关联数组,它使用第 3 列作为索引/键,使用第 4 列中的值作为键的值。 all of the suggested searches and links involve columns;所有建议的搜索和链接都涉及列; I need rows.我需要行。

Your problem is that your Awk script is not examining the results per key .您的问题是您的 Awk 脚本没有检查每个 key的结果。

Try this instead.试试这个。

awk -F , 'NR>1 { if(!($3 in course)) { low[$3] = high[$3] = $4 }
        if ($4 < low[$3]) low[$3] = $4;
        if ($4 > high[$3]) high[$3] = $4;
        sum[$3] += $4;
        ++course[$3] }
    END { OFS="\t"; print "Name", "Low", "High", "Avg";
        for (k in course)
          print k, low[k], high[k], sum[k]/course[k] }' file.csv

Result for your sample data:样本数据的结果:

Name    Low High    Avg
FINAL   58  99  85.6
L01 41  91  75
L02 0   100 61.4
L03 88  100 95.8
L04 48  100 79.2
L05 0   96  46.6
Q01 25  100 74.8
L06 0   99  61
Q02 84  100 91.6
L07 0   86  64.8
H01 19  90  47.8
WS  5   5   5
Q03 33  100 66.8
H02 47  95  74.6
Q04 55  100 81
H03 70  95  82.2
Q05 54  99  76.8
H04 46  80  65
Q06 77  95  86.8
H05 54  95  76.6
Q07 58  100 78.8
H06 58  97  80.6
H07 52  90  67

Calculating an average by dividing by line number only works when you want the average for the whole file (and even then of course if you are skipping some lines at the start, those should be subtracted from the divisor too).仅当您想要整个文件的平均值时,通过除以行号来计算平均值才有效(当然,如果您在开始时跳过某些行,也应该从除数中减去这些行)。

If you want to keep the output in order, you can do something similar to:如果你想保持输出有序,你可以做类似的事情:

awk -F, '
BEGIN { printf "Name\tLow\tHigh\tAvg\n" }
NR > 1 {
    if ($3 in low) {            # if assignment already initialized
        if ($4 < low[$3])       # check new low score
            low[$3] = $4
        if ($4 > hi[$3])        # check new high score
            hi[$3] = $4
        sum[$3] += $4           # add to assignment sum
        grades[$3]++            # add to assignment score count
    }
    else {                      # new assignment name
        name[n++] = $3          # keep indexed array of names (for order)
        low[$3] = $4            # initialize low for assignment
        hi[$3]  = $4            # initialize high for assignment
        sum[$3] = $4            # initialize sum for assignment
        grades[$3] = 1          # initialize score count for assignment
    }
}
END {
    for (i=0; i<n; i++)         # output informaton in order
        printf "%s\t%d\t%d\t%.2f\n", name[i], low[name[i]], hi[name[i]], sum[name[i]]/grades[name[i]]
}' score-data.csv

The indexed array names above is used to preserve the assignment names in the order seen and then to iterate over the assignments for output in order:上面的索引数组names用于按照看到的顺序保留赋值名称,然后按顺序迭代输出的赋值:

Example Use/Output示例使用/输出

Name    Low     High    Avg
H01     19      90      47.80
H02     47      95      74.60
H03     70      95      82.20
H04     46      80      65.00
H05     54      95      76.60
H06     58      97      80.60
H07     52      90      67.00
L01     41      91      75.00
L02     0       100     61.40
L03     88      100     95.80
L04     48      100     79.20
L05     0       96      46.60
L06     0       99      61.00
L07     0       86      64.80
Q01     25      100     74.80
Q02     84      100     91.60
Q03     33      100     66.80
Q04     55      100     81.00
Q05     54      99      76.80
Q06     77      95      86.80
Q07     58      100     78.80
FINAL   58      99      85.60
WS      5       5       5.00

It's not awk, but GNU Datamash is a handy tool designed just for this sort of calculation:它不是 awk,但GNU Datamash是一个方便的工具,专为此类计算而设计:

$ datamash -t, --header-in -g3 -s min 4 max 4 mean 4 < grades.csv \
  | awk 'BEGIN { FS=","; OFS="\t"; print "Name\tLow\tHigh\tAvg" } { $1=$1 } 1'
Name    Low     High    Avg
FINAL   58      99      85.6
H01     19      90      47.8
H02     47      95      74.6
H03     70      95      82.2
H04     46      80      65
H05     54      95      76.6
H06     58      97      80.6
H07     52      90      67
L01     41      91      75
L02     0       100     61.4
L03     88      100     95.8
L04     48      100     79.2
L05     0       96      46.6
L06     0       99      61
L07     0       86      64.8
Q01     25      100     74.8
Q02     84      100     91.6
Q03     33      100     66.8
Q04     55      100     81
Q05     54      99      76.8
Q06     77      95      86.8
Q07     58      100     78.8
WS      5       5       5

Okay, so there's an awk bit to print the desired header and convert from CSV to TSV.好的,所以有一个 awk 位来打印所需的标题并从 CSV 转换为 TSV。

This invocation says that comma is the field delimiter ( -t, ), that the input file has a header line, that it should be grouped and sorted on the third column ( -g3 -s ; datamash requires that the groups be sorted), and for each group, the minimum, maximum, and mean values of the fourth column should be calculated.这个调用说逗号是字段分隔符( -t, ),输入文件有一个标题行,它应该在第三列上进行分组和排序( -g3 -s ; datamash要求对组进行排序),对于每组,应计算第四列的最小值、最大值和平均值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM