簡體   English   中英

重 3GB csv 數據庫的 sed/awk 處理問題

[英]sed/awk processing question for heavy 3GB csv database

我被賦予了管理一些舊 LTO 磁帶數據庫的責任,並認為這將是構建功能庫並同時學習一些 bash 腳本和文本處理的好機會。 csv 數據庫大約有 3000 萬行,每個大約 3GB。 我在使用 grep 和 regex 定位行方面變得相當有效,但現在我想用 sed/awk 重新格式化整個 csv 文件,以便更快地處理。 這比我預期的要困難,希望一些專家能指出我正確的方向。 csv數據庫的格式如下:

<START OF FILE>
AE19T1JA47 -

File Name,Directory Name,Size of File,Time Last Modified

Trash,,0,2013-12-20 13:38:04
RAW FOOTAGE,,0,2013-12-20 13:39:00
DAEDALUS - ARCHIVE - 122013,,0,2013-12-20 13:40:00
STAR_HAFFLEN_PORTER_ROBINSON,DAED3 - ARCHIVE - 122013,0,2013-12-20 13:40:00
STAR_JAPAN_SETTING_SUN_092413,DAED3 - ARCHIVE - 122013,0,2013-12-20 13:40:00
STAR_YTMA_090713,DAED3 - ARCHIVE - 122013,0,2013-12-20 13:40:00
Audio,DAED3 - ARCHIVE - 122013/STAR_BILLYB_PORTER_ROBINSON,0,2013-09-03 11:21:00
Footage,DAED3 - ARCHIVE - 122013/STAR_BILLYB_PORTER_ROBINSON,0,2013-12-20 13:40:00
FWN_ASPERA_TEST_FTG,RAW FOOTAGE,0,2013-12-20 13:40:00
LANA_BRISK_REWSTO_WEEKEND_CASH_121813_RAW,RAW FOOTAGE,0,2013-12-20 13:40:00
LANA_STAR_WORLD_TURNT_LOST_WORLDS_121713_RAW,RAW FOOTAGE,0,2013-12-20 13:40:00
CZECH_PILOTS_ARCHIVAL,RAW FOOTAGE,0,2013-12-20 13:40:00
STAR_CAND_ELVY_121713_RAW,RAW FOOTAGE,0,2013-12-20 13:40:00
STAR_NEWS_PROMOS_PETE_122013_RAW,RAW FOOTAGE,0,2013-12-20 13:40:00
STAR_PODCAST_STEVE_Q_NG_121913_RAW,RAW FOOTAGE,0,2013-12-20 13:40:00
A242_C035_0101MR.RDC,RAW FOOTAGE/FWN_ASPERA_TEST_FTG,0,2013-12-20 13:40:00
md5,RAW FOOTAGE/FWN_FTP_TEST_FTG/A242_C035_0101MR.RDC,0,2013-08-30 08:19:00
MVI_9292.THM,RAW FOOTAGE/STAR_CRANK_ELVY_PROMO_121613_RAW/STAR_CRANK_ELVY_BONES_PROMO_121613_A_01/DCIM/100EOS7D,18687,2013-12-13 17:16:00
._MVI_9293.MOV,RAW FOOTAGE/STAR_CRANK_ELVY_PROMO_121613_RAW/STAR_CRANK_ELVY_BONES_PROMO_121613_A_01/DCIM/100EOS7D,4096,2013-12-20 14:43:00
MVI_9286.THM,RAW FOOTAGE/STAR_CRANK_ELVY_PROMO_121613_RAW/STAR_CRANK_ELVY_BONES_PROMO_121613_A_01/DCIM/100EOS7D,11570,2013-12-13 17:06:00
._MVI_9294.THM,RAW FOOTAGE/STAR_CRANK_ELVY_PROMO_121613_RAW/STAR_CRANK_ELVY_BONES_PROMO_121613_A_01/DCIM/100EOS7D,4096,2013-12-20 14:43:00
MVI_9286.MOV,RAW FOOTAGE/STAR_CRANK_ELVY_PROMO_121613_RAW/STAR_CRANK_ELVY_BONES_PROMO_121613_A_01/DCIM/100EOS7D,387269573,2013-12-13 17:06:00
._.DS_Store,,4096,2013-12-21 16:01:00
.DS_Store,,6148,2013-12-21 16:01:00


AE19T1ML3W -

File Name,Directory Name,Size of File,Time Last Modified

Trash,,0,2013-12-21 16:21:39
DRIVE BACKUPS,,0,2013-12-21 16:27:00
STAR_00112_500GB_BOMBU_REELS,DRIVE BACKUPS,0,2013-12-21 16:27:00
STAR_LANACannesGabby_00106,DRIVE BACKUPS,0,2013-12-21 16:26:00
STAR_01113_1TB_southy_Freeski,DRIVE BACKUPS,0,2013-12-21 16:27:00
STAR 1 TB 31,DRIVE BACKUPS,0,2013-12-21 16:27:00
Media,DRIVE BACKUPS/STAR_00112_500GB_BOMBU_REELS,0,2013-12-21 16:27:00
V_BOMBU_ALLVERSIONS_20131121,DRIVE BACKUPS/STAR_00112_500GB_BOMBU_REELS/Media,0,2013-12-21 16:27:00
tabsz_LOREAL_DELIVERY_082213,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00
LANA_SIZZLE_REEL_082213,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00
43_STAR SWSW,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00
2013-03-16.bbr,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00
ADDITIONAL_tabsz_FILES,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00
Autosave Vault,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00
WADU_SATURDAY,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00


<END OF FILE>

從結構上分解,每個 csv 數據庫如下所示:

<START OF FILE>
<LTO TAPE NAME><SPACE><DASH>
<NEWLINE>
<TOC LEGEND>
<NEWLINE>
<CONTENTS OF TAPE ABOVE>
<NEWLINE>
<NEWLINE>
<NAME OF NEXT LTO TAPE><SPACE><DASH>
<NEWLINE>
<TOC LEGEND>
<NEWLINE>
<CONTENTS OF TAPE ABOVE>
<NEWLINE>
<NEWLINE>
<END OF FILE>

我想通過采用 LTO 磁帶名稱並將其附加到由逗號分隔的內容行的末尾來截斷整個數據庫,以便我更輕松地查看每個文件所在的磁帶。 本質上,我想采用上面的結構,並將其重新格式化為:

<START OF FILE>
<TOC LEGEND>
<CONTENTS OF TAPE>,<RESPECTIVE TAPE NAME>
<CONTENTS OF TAPE...>,<RESEPCTIVE TAPE NAME>
...
<END OF FILE>
awk -F, '{
 {if (/^[A-Z0-9]* -$/)
  {split($1,name," ")}
 else if (NF == 4 && $4 != "Time Last Modified")
  {print $0","name[1]}}}' tape.txt

如果我了解您要正確執行的操作,這可能會起作用。 這將嘗試使用正則表達式查找磁帶的名稱。 如果找到與正則表達式匹配的內容,它將在空間上拆分以簡單地獲取名稱。 然后它將查找任何具有 4 個字段且第 4 個字段沒有“上次修改時間”的行,然后打印出該行並在末尾附加名稱數組的第一個值。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM