使用awk在列中插入日期

Question

我正在嘗試在csv的列中格式化日期。

輸入如下： 28 April 1966

我想要這個輸出： 1966-04-28

可以通過以下代碼獲得：

date -d "28 April 1966" +%F

因此，現在我想到了混合使用awk和此代碼來格式化整個列，但是我找不到方法。

編輯：

輸入示例：（分隔符“ |”實際上是選項卡）

1 | 28 April 1966
2 | null
3 | null
4 | 30 June 1987

預期產量：

1 | 1966-04-28
2 | null
3 | null
4 | 30 June 1987

Answer 1

一個簡單的方法是

awk -F '\\| ' -v OFS='| ' '{ cmd = "date -d \"" $3 "\" +%F 2> /dev/null"; cmd | getline $3; close(cmd) } 1' filename

那是：

{
  cmd = "date -d \"" $3 "\" +%F 2> /dev/null"  # build shell command
  cmd | getline $3                             # run, capture output
  close(cmd)                                   # close pipe
}
1                                              # print

之所以有效，是因為如果日期無效，則date不會在其stdout中打印任何內容，因此getline失敗並且$3不變。

注意事項：

對於非常大的文件，這將在這些shell中產生很多shell和進程（每行一個）。 這可能會成為明顯的性能下降。
警惕代碼注入。 如果CSV文件來自不可靠的來源，那么這種方法很難抵御攻擊者的攻擊，最好還是mktime ，用gawk的mktime和strftime手動解析日期。

編輯回復：注釋：要將選項卡用作分隔符，可以將命令更改為

awk -F '\t' -v OFS='\t' '{ cmd = "date -d \"" $3 "\" +%F 2> /dev/null"; cmd | getline $3; close(cmd) } 1' filename

編輯：評論2：如果似乎擔心性能，那么生成每一行的過程並不是一個好方法。 在這種情況下，您將必須手動進行解析。 例如：

BEGIN {
  OFS = FS

  m["January"  ] =  1
  m["February" ] =  2
  m["March"    ] =  3
  m["April"    ] =  4
  m["May"      ] =  5
  m["June"     ] =  6
  m["July"     ] =  7
  m["August"   ] =  8
  m["September"] =  9
  m["October"  ] = 10
  m["November" ] = 11
  m["December" ] = 12
}

$3 !~ /null/ {
  split($3, a, " ")
  $3 = sprintf("%04d-%02d-%02d", a[3], m[a[2]], a[1])
}
1

將其放在文件中，例如foo.awk ，然后運行awk -F '\\t' -f foo.awk filename.csv 。

Answer 2

這應該與您給定的輸入一起工作

awk -F'\\|' -vOFS="|" '!/null/{cmd="date -d \""$3"\" +%F";cmd | getline $3;close(cmd)}1' file

產量

| 1 |1966-04-28
| 2 | null
| 3 | null
| 4 |1987-06-30

Answer 3

我建議使用支持解析日期的語言，例如perl：

$ cat file
1       28 April 1966
2       null
3       null
4       30 June 1987
$ perl -F'\t' -MTime::Piece -lane 'print "$F[0]\t", 
  $F[1] eq "null" ? $F[1] : Time::Piece->strptime($F[1], "%d %B %Y")->strftime("%F")' file
1       1966-04-28
2       null
3       null
4       1987-06-30

Time::Piece核心模塊允許您使用strftime的標准格式說明符來解析和格式化日期。 如果第二個字段不是“ null”，則此解決方案將輸入拆分為制表符並修改格式。

這種方法比使用system調用或調用子流程要快得多，因為一切都在本機perl中完成。

Answer 4

這是在純BASH中執行此操作並避免從awk調用system或getline ：

while IFS=$'\t' read -ra arr; do 
   [[ ${arr[1]} != "null" ]] && arr[1]=$(date -d "${arr[1]}" +%F)
   printf "%s\t%s\n" "${arr[0]}" "${arr[1]}"
done < file

1       1966-04-28
2       null
3       null
4       1987-06-30

Answer 5

只能進行一次日期調用，並且沒有代碼注入問題，請參閱以下內容：

該腳本將日期提取（使用awk）到一個臨時文件中，並通過一個“日期”調用對其進行處理，然后將結果合並回去（使用awk）。

碼

awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' input > temp.$$
date --file=temp.$$ +%F > dates.$$
awk -F '\t' -v OFS='\t' 'BEGIN {
                           while ( getline < "'"dates.$$"'" > 0 )
                           {
                              f1_counter++
                              if ($0 == "0000-01-01") {$0 = "null"}
                              date[f1_counter] = $0
                           }
                         }
                         {$3 = date[NR]}
                         1' input.$$

一內襯使用bash進程重定向（沒有臨時文件）：

inputfile=/path/to/input
awk -F '\t' -v OFS='\t' 'BEGIN {while ( getline < "'<(date -f <(awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' "$inputfile") +%F)'" > 0 ){f1_counter++; if ($0 == "0000-01-01") {$0 = "null"}; date[f1_counter] = $0}}{$3 = date[NR]}1' "$inputfile"

細節

使用方法如下：

# configuration
input=/path/to/input
temp1=temp.$$
temp2=dates.$$
output=output.$$
# create the sample file (optional)
#printf "\t%s\n" $'1\t28 April 1966' $'2\tnull' $'3\tnull'  $'4\t30 June 1987' > "$input"
# Extract all dates
awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' "$input" > "$temp1"
# transform the dates
date --file="$temp1" +%F > "$temp2"
# merge csv with transformed date
awk -F '\t' -v OFS='\t' 'BEGIN {while ( getline < "'"$temp2"'" > 0 ){f1_counter++; if ($0 == "0000-01-01") {$0 = "null"}; date[f1_counter] = $0}}{$3 = date[NR]}1' "$input" > "$output"
# print the output
cat "$output"
# cleanup
rm "$temp1" "$temp2" "$output"
#rm "$input"

注意事項

使用“ 0000-01-01”作為無效（空）日期的臨時占位符
該代碼應該比其他調用“ date”的方法快很多倍，但是它會讀取輸入文件兩次。

使用awk在列中插入日期

問題描述

5 個解決方案

解決方案1
3 已采納 2015-04-24 10:22:21

解決方案2
1

產量

解決方案3
1 2015-04-24 11:18:55

解決方案4
0 2015-04-24 10:45:09

解決方案5
0 2015-04-24 12:03:39

碼

細節

注意事項

使用awk在列中插入日期

問題描述

5 個解決方案

解決方案1 3 已采納 2015-04-24 10:22:21

解決方案2 1

產量

解決方案3 1 2015-04-24 11:18:55

解決方案4 0 2015-04-24 10:45:09

解決方案5 0 2015-04-24 12:03:39

碼

細節

注意事項

解決方案1
3 已采納 2015-04-24 10:22:21

解決方案2
1

解決方案3
1 2015-04-24 11:18:55

解決方案4
0 2015-04-24 10:45:09

解決方案5
0 2015-04-24 12:03:39