[英]Convert flat file to a different format using shell or python
I have a file in the below format: 我有以下格式的文件:
User: user1
Count:3
Sum:80
departmentId: dept1
Amount by departmentId: 20
departmentId: dept1
Amount by departmentId: 35
departmentId: dept2
Amount by departmentId: 25
User: user2
Count:3
Sum:7.199999999999999
departmentId: dept1
Amount by departmentId: 2.4
departmentId: dept2
Amount by departmentId: 2.4
departmentId: dept3
Amount by departmentId: 2.4
User: user3
Count:1
Sum:0.2
departmentId: dept2
Amount by departmentId: 0.2
User: user4
Count:2
Sum:2
departmentId: dept3
Amount by departmentId: 1
departmentId: dept3
Amount by departmentId: 1
The file list basically the User dues for a department. 文件列表基本上是部门的用户费用。 If the same user is due to a department multiple times then that need to be merged into one row.
如果同一用户多次属于某个部门,则需要将该部门合并为一行。 The output file needs to be in the below format.
输出文件必须采用以下格式。
EDIT: For user1, he has 2 dues for dept1 and 1 due for dept2. 编辑:对于user1,他对dept1有2个会费,对dept2有1个会费。 So in the output file the 2 dues for dept1 need to be merged into 1 and the total count on line will be 2, as count needs to be user per department.
因此,在输出文件中,需要将dept1的2个会费合并为1,并且在线的总计数为2,因为该计数需要每个部门的用户。
Format:
count total_sum
userId+deptId sum for that dept
Example:
2 80
user1dept1 55
user1dept2 25
3 7.2
user2dept1 2.4
user2dept2 2.4
user2dept3 2.4
1 0.2
user3dept2 0.2
1 2
user4dept3 2
Please advice on which scripting language to use, bash or python?And how to loop through the input file? 请建议使用哪种脚本语言,bash或python?以及如何在输入文件中循环? Thanks
谢谢
You don't use a shell to manipulate text (see https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why). 您不使用外壳来操纵文本(请参阅https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice一些原因)。 Awk is the tool that the guys who invented shell invented for shell to call to manipulate text so just use that.
Awk是发明shell的人发明的工具,shell调用它们来操纵文本,因此只需使用它即可。
$ cat tst.awk
BEGIN { FS=": *" }
{
gsub(/^ +| +$/,"")
f[$1] = $2
}
/Amount/ {
dept = f["departmentId"]
subTot[dept] += $2
tot += $2
}
$1 == "User" {
if (NR>1) {
prt()
}
user = $2
}
END { prt() }
function prt() {
print length(subTot), tot
for (dept in subTot) {
print user dept, subTot[dept]
}
delete subTot
tot = 0
}
. 。
$ awk -f tst.awk file
2 80
user1dept1 55
user1dept2 25
3 7.2
user2dept1 2.4
user2dept2 2.4
user2dept3 2.4
1 0.2
user3dept2 0.2
1 2
user4dept3 2
The above assumes you have an awk where length(array)
gives you the number of elements in an array. 上面假设您有awk,其中
length(array)
给出了数组中元素的数量。 If you don't then just count every time you see a new dept for the current user (eg by using if (!(dept in subTot)) numDepts++
just before you populate subTot[dept]
) and print that value instead. 如果您不这样做,则每次查看当前用户的新部门时(例如,在填充
subTot[dept]
之前通过使用if (!(dept in subTot)) numDepts++
)并打印该值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.