简体   繁体   English

如何在 csv.bz2 中更改日期格式列并在 Unix 中更新文件

[英]How change date Format column in csv.bz2 and update file in Unix

I have a bz2 compressed csv file with millions of records.The first row contains the file header.我有一个包含数百万条记录的 bz2 压缩 csv 文件。第一行包含文件头。 I have multiple date columns but there are only two specific columns that i need to change to Format the date from yyyy/mm/dd hh24:mm and set it to dd/mm/yy hh24:mm and update the csv.bz2 file.我有多个日期列,但只有两个特定列我需要更改为从 yyyy/mm/dd hh24:mm 格式化日期并将其设置为 dd/mm/yy hh24:mm 并更新 csv.bz2 文件。 I was looking researching with awk, with this我正在用 awk 进行研究,这个

bzcat OBACountryIPTV_CALICUX_Report_Session_Local_Timezone_2021-07-02_000000.csv.bz2 |
head -1 | sed 's /; / \ n / g' | nl

I managed to find the position of the columns that I need to reformat:我设法找到了需要重新格式化的列的位置:

  1  playtime
  2  traffic
  .
  .
 30  end_time
 31  start_time
 .
 .
122  playback_id

The output of each row for example:每行的输出例如:

0.0;0.0;;0.0;0;;0.0;0;0.0;0.0;7.0;-1.0;0.0;0.0;0;0;0;0;0;0;0;0;0.0;0.0;0.0;;;0.0;0.0;2021/05/11 17:52;2021/05/11 17:52;0.0;Argentina;;;;LATAM;Unknown;WIFI;Undefined;;Undefined;Streaming;TELEFE;N/A - Multicast;;LIVE;;STB;;;;;;;;Not reported;;;29_2255508521_1620777131852;{"channel_id": 3346, "channel_name": "TELEFE", "channel_number": null, "delaytime": null, "dist_id": null, "player_session_name": null, "playtime": null, "program_id": 500508915, "resolution": "HD", "device_version": "2020.06.26.00.08.27", "ip": "10.241.164.84", "ob": "29", "device_type": null, "user_id": "17059165", "device_id": "2255508521", "operation": "LIV/channel-change", "transaction_id": "476707713_2255508521_29", "status": null, "timestamp": "2021-05-11T20:52:11.852-03:00", "bit_rate": "7Mbps", "qoe_sessionId": null, "type_event": null};;;;;N/A;17059165;;10.241.164.84;IPv4;29;HD;;;;;;;;;WIFI;DTV;IPTV;2255508521;0;4D535443;BA-ICS42-GPON;CEICS02;SCSJS01;RCSJS01;;;;;;0;0;0;0;0;0;0;0;3.0.11;1280x720@60p;N/A;2020.06.26.00.08.27;5263162195;LiveTV;N/A;;;1;ALL;VIP4242W;239.130.1.0:22220;0;0;;;EBVS;;

Which in this case would be 30 and 31, but I can't find how to update that pair of fields within the file.在这种情况下将是 30 和 31,但我找不到如何更新文件中的那对字段。 Any suggestion?有什么建议吗?

You can't directly edit a bz2 compressed file, the only way of achieving what you're asking for that I can think of is this:您不能直接编辑 bz2 压缩文件,我能想到的实现您所要求的唯一方法是:

bzcat OBACountryIPTV_CALICUX_Report_Session_Local_Timezone_2021-07-02_000000.csv.bz2 | awk 'BEGIN{FS=OFS=";"}{$30=gensub(/^..(..)\/(..)\/(..)(.*)$/, "\\3/\\2/\\1\\4","1",$30);$31=gensub(/^..(..)\/(..)\/(..)(.*)$/, "\\3/\\2/\\1\\4","1",$31);print }' | bzip2 > OBACountryIPTV_CALICUX_Report_Session_Local_Timezone_2021-07-02_000000.csv.bz2.tmp
mv OBACountryIPTV_OBACountryIPTV_CALICUX_Report_Session_Local_Timezone_2021-07-02_000000.csv.bz2.tmp CALICUX_Report_Session_Local_Timezone_2021-07-02_000000.csv.bz2  #if the size looks right and you verified the content has changed as expected.

I'm certain that there are easier/more efficient ways of doing the rearranging with awk;我确信使用 awk 进行重新排列有更简单/更有效的方法; also note that gensub is a GNU awk extension and not POSIX-compliant.还要注意gensub是一个GNU awk扩展并且不符合 POSIX。

Explanation of the awk part: awk部分说明:

BEGIN{FS=OFS=";"} - tell awk that ; BEGIN{FS=OFS=";"} - 告诉 awk ; is a field separator.是字段分隔符。

$30=gensub(/^..(..)\\/(..)\\/(..)(.*)$/, "\\\\3/\\\\2/\\\\1\\\\4","1",$30) - this parses the 30th field (the first of the unwanted date-format fields), ^.. matches the century section of the year, (..) captures the year and decade, the subsequent ones capture month and day, respectively. $30=gensub(/^..(..)\\/(..)\\/(..)(.*)$/, "\\\\3/\\\\2/\\\\1\\\\4","1",$30) - 解析第 30 个字段(第一个不需要的日期格式字段), ^..匹配年份的世纪部分, (..)捕获年份和十年,随后的捕获月份和日期, 分别。 (.*) then captures the time. (.*)然后捕获时间。 "\\\\3/\\\\2/\\\\1\\\\4" reconstructs the time and date using the captured groups (they get numbered in the order they're captured, so \\\\1 matches the last two digits of the year) and reassigns the value to the 30th field. "\\\\3/\\\\2/\\\\1\\\\4"使用捕获的组重建时间和日期(它们按照捕获的顺序编号,因此\\\\1匹配年份的最后两位数字)并将值重新分配给第 30 个字段。

We then just repeat the same process for the 31st field.然后我们只需对第 31 个字段重复相同的过程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM