简体   繁体   English

使用awk在列中插入日期

[英]Insert a date in a column using awk

I'm trying to format a date in a column of a csv. 我正在尝试在csv的列中格式化日期。

The input is something like: 28 April 1966 输入如下: 28 April 1966

And I'd like this output: 1966-04-28 我想要这个输出: 1966-04-28

which can be obtain with this code: 可以通过以下代码获得:

date -d "28 April 1966" +%F

So now I thought of mixing awk and this code to format the entire column but I can't find out how. 因此,现在我想到了混合使用awk和此代码来格式化整个列,但是我找不到方法。

Edit : 编辑:

Example of input : (separators "|" are in fact tabs) 输入示例:(分隔符“ |”实际上是选项卡)

1 | 28 April 1966
2 | null
3 | null
4 | 30 June 1987 

Expected output : 预期产量:

1 | 1966-04-28
2 | null
3 | null
4 | 30 June 1987

A simple way is 一个简单的方法是

awk -F '\\| ' -v OFS='| ' '{ cmd = "date -d \"" $3 "\" +%F 2> /dev/null"; cmd | getline $3; close(cmd) } 1' filename

That is: 那是:

{
  cmd = "date -d \"" $3 "\" +%F 2> /dev/null"  # build shell command
  cmd | getline $3                             # run, capture output
  close(cmd)                                   # close pipe
}
1                                              # print

This works because date doesn't print anything to its stdout if the date is invalid, so the getline fails and $3 is not changed. 之所以有效,是因为如果日期无效,则date不会在其stdout中打印任何内容,因此getline失败并且$3不变。

Caveats to consider: 注意事项:

  1. For very large files, this will spawn a lot of shells and processes in those shells (one each per line). 对于非常大的文件,这将在这些shell中产生很多shell和进程(每行一个)。 This can become a noticeable performance drag. 这可能会成为明显的性能下降。
  2. Be wary of code injection. 警惕代码注入。 If the CSV file comes from an untrustworthy source, this approach is difficult to defend against an attacker, and you're probably better off going the long way around, parsing the date manually with gawk's mktime and strftime . 如果CSV文件来自不可靠的来源,那么这种方法很难抵御攻击者的攻击,最好还是mktime ,用gawk的mktimestrftime手动解析日期。

EDIT re: comment: To use tabs as delimiters, the command can be changed to 编辑回复:注释:要将选项卡用作分隔符,可以将命令更改为

awk -F '\t' -v OFS='\t' '{ cmd = "date -d \"" $3 "\" +%F 2> /dev/null"; cmd | getline $3; close(cmd) } 1' filename

EDIT re: comment 2: If performance is a worry, as it appears to be, spawning processes for every line is not a good approach. 编辑:评论2:如果似乎担心性能,那么生成每一行的过程并不是一个好方法。 In that case, you'll have to do the parsing manually. 在这种情况下,您将必须手动进行解析。 For example: 例如:

BEGIN {
  OFS = FS

  m["January"  ] =  1
  m["February" ] =  2
  m["March"    ] =  3
  m["April"    ] =  4
  m["May"      ] =  5
  m["June"     ] =  6
  m["July"     ] =  7
  m["August"   ] =  8
  m["September"] =  9
  m["October"  ] = 10
  m["November" ] = 11
  m["December" ] = 12
}

$3 !~ /null/ {
  split($3, a, " ")
  $3 = sprintf("%04d-%02d-%02d", a[3], m[a[2]], a[1])
}
1

Put that in a file, say foo.awk , and run awk -F '\\t' -f foo.awk filename.csv . 将其放在文件中,例如foo.awk ,然后运行awk -F '\\t' -f foo.awk filename.csv

This should work with your given input 这应该与您给定的输入一起工作

awk -F'\\|' -vOFS="|" '!/null/{cmd="date -d \""$3"\" +%F";cmd | getline $3;close(cmd)}1' file

Output 产量

| 1 |1966-04-28
| 2 | null
| 3 | null
| 4 |1987-06-30

I would suggest using a language that supports parsing dates, like perl: 我建议使用支持解析日期的语言,例如perl:

$ cat file
1       28 April 1966
2       null
3       null
4       30 June 1987
$ perl -F'\t' -MTime::Piece -lane 'print "$F[0]\t", 
  $F[1] eq "null" ? $F[1] : Time::Piece->strptime($F[1], "%d %B %Y")->strftime("%F")' file
1       1966-04-28
2       null
3       null
4       1987-06-30

The Time::Piece core module allows you to parse and format dates, using the standard format specifiers of strftime . Time::Piece核心模块允许您使用strftime的标准格式说明符来解析和格式化日期。 This solution splits the input on a tab character and modifies the format if the second field is not "null". 如果第二个字段不是“ null”,则此解决方案将输入拆分为制表符并修改格式。

This approach will be much faster than using system calls or invoking subprocesses, as everything is done in native perl. 这种方法比使用system调用或调用子流程要快得多,因为一切都在本机perl中完成。

Here is how you can do this in pure BASH and avoid calling system or getline from awk: 这是在纯BASH中执行此操作并避免从awk调用systemgetline

while IFS=$'\t' read -ra arr; do 
   [[ ${arr[1]} != "null" ]] && arr[1]=$(date -d "${arr[1]}" +%F)
   printf "%s\t%s\n" "${arr[0]}" "${arr[1]}"
done < file

1       1966-04-28
2       null
3       null
4       1987-06-30

Only one date call and no code injection problem is possible, see the following: 只能进行一次日期调用,并且没有代码注入问题,请参阅以下内容:

This script extracts the dates (using awk) into a temporary file processes them with one "date" call and merges the results back (using awk). 该脚本将日期提取(使用awk)到一个临时文件中,并通过一个“日期”调用对其进行处理,然后将结果合并回去(使用awk)。

Code

awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' input > temp.$$
date --file=temp.$$ +%F > dates.$$
awk -F '\t' -v OFS='\t' 'BEGIN {
                           while ( getline < "'"dates.$$"'" > 0 )
                           {
                              f1_counter++
                              if ($0 == "0000-01-01") {$0 = "null"}
                              date[f1_counter] = $0
                           }
                         }
                         {$3 = date[NR]}
                         1' input.$$

One-liner using bash process redirections (no temporary files): 一内衬使用bash进程重定向(没有临时文件):

inputfile=/path/to/input
awk -F '\t' -v OFS='\t' 'BEGIN {while ( getline < "'<(date -f <(awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' "$inputfile") +%F)'" > 0 ){f1_counter++; if ($0 == "0000-01-01") {$0 = "null"}; date[f1_counter] = $0}}{$3 = date[NR]}1' "$inputfile"

Details 细节

here is how it can be used: 使用方法如下:

# configuration
input=/path/to/input
temp1=temp.$$
temp2=dates.$$
output=output.$$
# create the sample file (optional)
#printf "\t%s\n" $'1\t28 April 1966' $'2\tnull' $'3\tnull'  $'4\t30 June 1987' > "$input"
# Extract all dates
awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' "$input" > "$temp1"
# transform the dates
date --file="$temp1" +%F > "$temp2"
# merge csv with transformed date
awk -F '\t' -v OFS='\t' 'BEGIN {while ( getline < "'"$temp2"'" > 0 ){f1_counter++; if ($0 == "0000-01-01") {$0 = "null"}; date[f1_counter] = $0}}{$3 = date[NR]}1' "$input" > "$output"
# print the output
cat "$output"
# cleanup
rm "$temp1" "$temp2" "$output"
#rm "$input"

Caveats 注意事项

  • Using "0000-01-01" as a temporary placeholder for invalid (null) dates 使用“ 0000-01-01”作为无效(空)日期的临时占位符
  • The code should be faster than other methods calling "date" a lot of times, but it reads the input file two times. 该代码应该比其他调用“ date”的方法快很多倍,但是它会读取输入文件两次。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM