简体   繁体   English

Bash:读取 CSV 文本文件并查找行的平均值

[英]Bash: Reading CSV text file and finding average of rows

This is the sample input (the data has user-IDs and the number of hours spent by the user):这是示例输入(数据具有用户 ID 和用户花费的小时数):

Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0

I need to read the data, find all User-IDs ending in even numbers (2,4,6,8..) and find average number of hours spent (over five days).我需要读取数据,找到所有以偶数 (2,4,6,8..) 结尾的用户 ID,并找到平均花费的小时数(超过五天)。

I wrote the following script:我写了以下脚本:

hoursarray=(0,0,0,0,0)
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
    if [[ $col2 == *"2" ]]; then
        #echo "$col2"
        ((hoursarray[0] = col3 + col4 + col5 + col6 + col7))
    elif  [[ $col2 == *"4" ]]; then 
        #echo "$col2"
        ((hoursarray[1] = hoursarray[1] + col3 + col4 + col5 + col6 + col7))
    elif [[ $col2 == *"6" ]]; then
        #echo "$col2"
        ((hoursarray[2] = hoursarray[2] + col3 + col4 + col5 + col6 + col7))
    elif [[ $col2 == *"8" ]]; then
        #echo "$col2"
        ((hoursarray[3] = hoursarray[3] + col3 + col4 + col5 + col6 + col7))
    elif [[ $col2 == *"10" ]]; then
        #echo "$col2"
        ((hoursarray[4] = hoursarray[4] + col3 + col4 + col5 + col6 + col7))
    fi
done < <(tail -n+2 user-list.txt)
echo ${hoursarray[0]}
echo "$((hoursarray[0]/5))"

This is not a very good way of doing this.这不是一个很好的方法。 Also, the numbers arent adding up correctly.此外,这些数字加起来不正确。

I am getting the following output (for the first one - user2):我得到以下输出(第一个 - user2):

27
5

I am expecting the following output:我期待以下输出:

27
5.4

What would be a better way to do it?什么是更好的方法呢? Any help would be appreciated.任何帮助,将不胜感激。

TIA TIA

You issue is echo "$((hoursarray[0]/5))" Bash does not have floating point, so it returns the integer portion only.您发出的是echo "$((hoursarray[0]/5))" Bash 没有浮点数,因此它仅返回整数部分。

Easy to demonstrate:易于演示:

$ hours=27
$ echo "$((hours/5))"
5

If you want to stick to Bash, you could use bc for the floating point result:如果你想坚持使用 Bash,你可以使用bc作为浮点结果:

$ echo "$hours / 5.0" | bc -l
5.40000000000000000000

Or use awk , perl , python , ruby etc.或者使用awkperlpythonruby等。

Here is an awk you can parse out.这是您可以解析的awk Easily modified to you use (which is a little unclear to me)易于修改以供您使用(这对我来说有点不清楚)

awk -F, 'FNR==1{print $2; next} 
     {arr[$2]+=($3+$4+$5+$6+$7) }   
     END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file 

Prints:印刷:

User ID
User1       27  5.4
User2       27  5.4
User3       22  4.4
User4       20  4
User5       40  8

If you only want even users, filter for User that end in any of 0,2,4,6,8:如果您只想要偶数用户,请过滤以 0、2、4、6、8 中的任何一个结尾的User

awk -F, 'FNR==1{print $2; next} 
         $2~/[24680]$/ {arr[$2]+=($3+$4+$5+$6+$7) } 
         END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file

Prints:印刷:

User ID
User2       27  5.4
User4       20  4

Your description is fairly imprecise, but here's an attempt primarily based on the sample output:您的描述相当不准确,但这是主要基于示例输出的尝试:

awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};print a;printf "%.2g\n",a/5; a=0}' file 
20
4
27
5.4

$2~/[24680]$/ makes sure we only look at "even" user-IDs. $2~/[24680]$/确保我们只查看“偶数”用户 ID。

for(i=3;i<=7;i++){} iterates over the day columns and adds them. for(i=3;i<=7;i++){}迭代日期列并添加它们。

Edit 1: Accommodating new requirement:编辑 1:适应新要求:

awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' saad 
User4   4
User2   5.4

Sample data showing userIDs with even and odd endings, userID showing up more than once (eg, User2 ), and some non-integer values:示例数据显示具有偶数和奇数结尾的 userID、出现不止一次的 userID(例如User2 )和一些非整数值:

$ cat user-list.txt
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer5,User120,9,8,10,0,0
Computer5,User2,4,7,12,3.5,1.5

One awk solution to find total hours plus averages, across 5x days, with duplicate userIDs rolled into a single set of numbers, but limited to userIDs that end in an even number:一个awk解决方案,用于在 5 天中查找总小时数加上平均值,将重复的用户 ID 合并为一组数字,但仅限于以偶数结尾的用户 ID:

$ awk -F',' 'FNR==1 { next } $2 ~ /[02468]$/ { tot[$2]+=($3+$4+$5+$6+$7) } END { for ( i in tot ) { print i, tot[i], tot[i]/5 } }' user-list.txt

Where:在哪里:

  • -F ',' - use comma as input field delimiter -F ',' - 使用逗号作为输入字段分隔符
  • FNR==1 { next } - skip first line FNR==1 { next } - 跳过第一行
  • $2 ~ /[02468]$/ - if field 2 ends in an even number $2 ~ /[02468]$/ - 如果字段 2 以偶数结尾
  • tot[$2]+=($3+$4+$5+$6+$7) - add current line's hours to array where userID is the array index; tot[$2]+=($3+$4+$5+$6+$7) - 将当前行的小时数添加到数组中,其中 userID 是数组索引; this will add up hours from multiple input lines (for same userID) into a single array cell这会将多个输入行(对于相同的用户 ID)的小时数加到一个数组单元格中
  • for (...) { print ...} - loop through array indices printing the index, total hours and average hours (total divided by 5) for (...) { print ...} - 通过数组索引循环打印索引、总小时数和平均小时数(总小时数除以 5)

The above generates:以上生成:

User120 27 5.4
User2 55 11
User4 20 4

Depending on OPs desired output the print can be replaced with printf and the desired format string ...根据 OP 所需的输出, print可以替换为printf和所需的格式字符串...

 Here is your script modified a little bit:
  
 while IFS=, read -r col1 col2 col3 || [[ -n $col1 ]]
 do
       (( $(sed 's/[^[:digit:]]*//' <<<$col2) % 2 )) || ( echo -n "For $col1 $col2 average is: " && echo "($(tr , + <<<$col3))/5" | bc -l )
 done < <(tail -n+2 list.txt)

prints:印刷:

 For Computer3 User4 average is: 4.00000000000000000000
 For Computer5 User2 average is: 5.40000000000000000000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM