[英]Bash: Reading CSV text file and finding average of rows
This is the sample input (the data has user-IDs and the number of hours spent by the user):这是示例输入(数据具有用户 ID 和用户花费的小时数):
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
I need to read the data, find all User-IDs ending in even numbers (2,4,6,8..) and find average number of hours spent (over five days).我需要读取数据,找到所有以偶数 (2,4,6,8..) 结尾的用户 ID,并找到平均花费的小时数(超过五天)。
I wrote the following script:我写了以下脚本:
hoursarray=(0,0,0,0,0)
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [[ $col2 == *"2" ]]; then
#echo "$col2"
((hoursarray[0] = col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"4" ]]; then
#echo "$col2"
((hoursarray[1] = hoursarray[1] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"6" ]]; then
#echo "$col2"
((hoursarray[2] = hoursarray[2] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"8" ]]; then
#echo "$col2"
((hoursarray[3] = hoursarray[3] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"10" ]]; then
#echo "$col2"
((hoursarray[4] = hoursarray[4] + col3 + col4 + col5 + col6 + col7))
fi
done < <(tail -n+2 user-list.txt)
echo ${hoursarray[0]}
echo "$((hoursarray[0]/5))"
This is not a very good way of doing this.这不是一个很好的方法。 Also, the numbers arent adding up correctly.此外,这些数字加起来不正确。
I am getting the following output (for the first one - user2):我得到以下输出(第一个 - user2):
27
5
I am expecting the following output:我期待以下输出:
27
5.4
What would be a better way to do it?什么是更好的方法呢? Any help would be appreciated.任何帮助,将不胜感激。
TIA TIA
You issue is echo "$((hoursarray[0]/5))"
Bash does not have floating point, so it returns the integer portion only.您发出的是echo "$((hoursarray[0]/5))"
Bash 没有浮点数,因此它仅返回整数部分。
Easy to demonstrate:易于演示:
$ hours=27
$ echo "$((hours/5))"
5
If you want to stick to Bash, you could use bc
for the floating point result:如果你想坚持使用 Bash,你可以使用bc
作为浮点结果:
$ echo "$hours / 5.0" | bc -l
5.40000000000000000000
Or use awk
, perl
, python
, ruby
etc.或者使用awk
、 perl
、 python
、 ruby
等。
Here is an awk
you can parse out.这是您可以解析的awk
。 Easily modified to you use (which is a little unclear to me)易于修改以供您使用(这对我来说有点不清楚)
awk -F, 'FNR==1{print $2; next}
{arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:印刷:
User ID
User1 27 5.4
User2 27 5.4
User3 22 4.4
User4 20 4
User5 40 8
If you only want even users, filter for User
that end in any of 0,2,4,6,8:如果您只想要偶数用户,请过滤以 0、2、4、6、8 中的任何一个结尾的User
:
awk -F, 'FNR==1{print $2; next}
$2~/[24680]$/ {arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:印刷:
User ID
User2 27 5.4
User4 20 4
Your description is fairly imprecise, but here's an attempt primarily based on the sample output:您的描述相当不准确,但这是主要基于示例输出的尝试:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};print a;printf "%.2g\n",a/5; a=0}' file
20
4
27
5.4
$2~/[24680]$/
makes sure we only look at "even" user-IDs. $2~/[24680]$/
确保我们只查看“偶数”用户 ID。
for(i=3;i<=7;i++){}
iterates over the day columns and adds them. for(i=3;i<=7;i++){}
迭代日期列并添加它们。
Edit 1: Accommodating new requirement:编辑 1:适应新要求:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' saad
User4 4
User2 5.4
Sample data showing userIDs with even and odd endings, userID showing up more than once (eg, User2
), and some non-integer values:示例数据显示具有偶数和奇数结尾的 userID、出现不止一次的 userID(例如User2
)和一些非整数值:
$ cat user-list.txt
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer5,User120,9,8,10,0,0
Computer5,User2,4,7,12,3.5,1.5
One awk
solution to find total hours plus averages, across 5x days, with duplicate userIDs rolled into a single set of numbers, but limited to userIDs that end in an even number:一个awk
解决方案,用于在 5 天中查找总小时数加上平均值,将重复的用户 ID 合并为一组数字,但仅限于以偶数结尾的用户 ID:
$ awk -F',' 'FNR==1 { next } $2 ~ /[02468]$/ { tot[$2]+=($3+$4+$5+$6+$7) } END { for ( i in tot ) { print i, tot[i], tot[i]/5 } }' user-list.txt
Where:在哪里:
-F ','
- use comma as input field delimiter -F ','
- 使用逗号作为输入字段分隔符FNR==1 { next }
- skip first line FNR==1 { next }
- 跳过第一行$2 ~ /[02468]$/
- if field 2 ends in an even number $2 ~ /[02468]$/
- 如果字段 2 以偶数结尾tot[$2]+=($3+$4+$5+$6+$7)
- add current line's hours to array where userID is the array index; tot[$2]+=($3+$4+$5+$6+$7)
- 将当前行的小时数添加到数组中,其中 userID 是数组索引; this will add up hours from multiple input lines (for same userID) into a single array cell这会将多个输入行(对于相同的用户 ID)的小时数加到一个数组单元格中for (...) { print ...}
- loop through array indices printing the index, total hours and average hours (total divided by 5) for (...) { print ...}
- 通过数组索引循环打印索引、总小时数和平均小时数(总小时数除以 5)The above generates:以上生成:
User120 27 5.4
User2 55 11
User4 20 4
Depending on OPs desired output the print
can be replaced with printf
and the desired format string ...根据 OP 所需的输出, print
可以替换为printf
和所需的格式字符串...
Here is your script modified a little bit:
while IFS=, read -r col1 col2 col3 || [[ -n $col1 ]]
do
(( $(sed 's/[^[:digit:]]*//' <<<$col2) % 2 )) || ( echo -n "For $col1 $col2 average is: " && echo "($(tr , + <<<$col3))/5" | bc -l )
done < <(tail -n+2 list.txt)
prints:印刷:
For Computer3 User4 average is: 4.00000000000000000000
For Computer5 User2 average is: 5.40000000000000000000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.