[英]Insert a date in a column using awk
I'm trying to format a date in a column of a csv. 我正在尝试在csv的列中格式化日期。
The input is something like: 28 April 1966
输入如下: 28 April 1966
And I'd like this output: 1966-04-28
我想要这个输出: 1966-04-28
which can be obtain with this code: 可以通过以下代码获得:
date -d "28 April 1966" +%F
So now I thought of mixing awk and this code to format the entire column but I can't find out how. 因此,现在我想到了混合使用awk和此代码来格式化整个列,但是我找不到方法。
Edit : 编辑:
Example of input : (separators "|" are in fact tabs) 输入示例:(分隔符“ |”实际上是选项卡)
1 | 28 April 1966
2 | null
3 | null
4 | 30 June 1987
Expected output : 预期产量:
1 | 1966-04-28
2 | null
3 | null
4 | 30 June 1987
A simple way is 一个简单的方法是
awk -F '\\| ' -v OFS='| ' '{ cmd = "date -d \"" $3 "\" +%F 2> /dev/null"; cmd | getline $3; close(cmd) } 1' filename
That is: 那是:
{
cmd = "date -d \"" $3 "\" +%F 2> /dev/null" # build shell command
cmd | getline $3 # run, capture output
close(cmd) # close pipe
}
1 # print
This works because date
doesn't print anything to its stdout if the date is invalid, so the getline
fails and $3
is not changed. 之所以有效,是因为如果日期无效,则date
不会在其stdout中打印任何内容,因此getline
失败并且$3
不变。
Caveats to consider: 注意事项:
mktime
and strftime
. 如果CSV文件来自不可靠的来源,那么这种方法很难抵御攻击者的攻击,最好还是mktime
,用gawk的mktime
和strftime
手动解析日期。 EDIT re: comment: To use tabs as delimiters, the command can be changed to 编辑回复:注释:要将选项卡用作分隔符,可以将命令更改为
awk -F '\t' -v OFS='\t' '{ cmd = "date -d \"" $3 "\" +%F 2> /dev/null"; cmd | getline $3; close(cmd) } 1' filename
EDIT re: comment 2: If performance is a worry, as it appears to be, spawning processes for every line is not a good approach. 编辑:评论2:如果似乎担心性能,那么生成每一行的过程并不是一个好方法。 In that case, you'll have to do the parsing manually. 在这种情况下,您将必须手动进行解析。 For example: 例如:
BEGIN {
OFS = FS
m["January" ] = 1
m["February" ] = 2
m["March" ] = 3
m["April" ] = 4
m["May" ] = 5
m["June" ] = 6
m["July" ] = 7
m["August" ] = 8
m["September"] = 9
m["October" ] = 10
m["November" ] = 11
m["December" ] = 12
}
$3 !~ /null/ {
split($3, a, " ")
$3 = sprintf("%04d-%02d-%02d", a[3], m[a[2]], a[1])
}
1
Put that in a file, say foo.awk
, and run awk -F '\\t' -f foo.awk filename.csv
. 将其放在文件中,例如foo.awk
,然后运行awk -F '\\t' -f foo.awk filename.csv
。
This should work with your given input 这应该与您给定的输入一起工作
awk -F'\\|' -vOFS="|" '!/null/{cmd="date -d \""$3"\" +%F";cmd | getline $3;close(cmd)}1' file
| 1 |1966-04-28
| 2 | null
| 3 | null
| 4 |1987-06-30
I would suggest using a language that supports parsing dates, like perl: 我建议使用支持解析日期的语言,例如perl:
$ cat file
1 28 April 1966
2 null
3 null
4 30 June 1987
$ perl -F'\t' -MTime::Piece -lane 'print "$F[0]\t",
$F[1] eq "null" ? $F[1] : Time::Piece->strptime($F[1], "%d %B %Y")->strftime("%F")' file
1 1966-04-28
2 null
3 null
4 1987-06-30
The Time::Piece
core module allows you to parse and format dates, using the standard format specifiers of strftime
. Time::Piece
核心模块允许您使用strftime
的标准格式说明符来解析和格式化日期。 This solution splits the input on a tab character and modifies the format if the second field is not "null". 如果第二个字段不是“ null”,则此解决方案将输入拆分为制表符并修改格式。
This approach will be much faster than using system
calls or invoking subprocesses, as everything is done in native perl. 这种方法比使用system
调用或调用子流程要快得多,因为一切都在本机perl中完成。
Here is how you can do this in pure BASH and avoid calling system
or getline
from awk: 这是在纯BASH中执行此操作并避免从awk调用system
或getline
:
while IFS=$'\t' read -ra arr; do
[[ ${arr[1]} != "null" ]] && arr[1]=$(date -d "${arr[1]}" +%F)
printf "%s\t%s\n" "${arr[0]}" "${arr[1]}"
done < file
1 1966-04-28
2 null
3 null
4 1987-06-30
Only one date call and no code injection problem is possible, see the following: 只能进行一次日期调用,并且没有代码注入问题,请参阅以下内容:
This script extracts the dates (using awk) into a temporary file processes them with one "date" call and merges the results back (using awk). 该脚本将日期提取(使用awk)到一个临时文件中,并通过一个“日期”调用对其进行处理,然后将结果合并回去(使用awk)。
awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' input > temp.$$
date --file=temp.$$ +%F > dates.$$
awk -F '\t' -v OFS='\t' 'BEGIN {
while ( getline < "'"dates.$$"'" > 0 )
{
f1_counter++
if ($0 == "0000-01-01") {$0 = "null"}
date[f1_counter] = $0
}
}
{$3 = date[NR]}
1' input.$$
One-liner using bash process redirections (no temporary files): 一内衬使用bash进程重定向(没有临时文件):
inputfile=/path/to/input
awk -F '\t' -v OFS='\t' 'BEGIN {while ( getline < "'<(date -f <(awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' "$inputfile") +%F)'" > 0 ){f1_counter++; if ($0 == "0000-01-01") {$0 = "null"}; date[f1_counter] = $0}}{$3 = date[NR]}1' "$inputfile"
here is how it can be used: 使用方法如下:
# configuration
input=/path/to/input
temp1=temp.$$
temp2=dates.$$
output=output.$$
# create the sample file (optional)
#printf "\t%s\n" $'1\t28 April 1966' $'2\tnull' $'3\tnull' $'4\t30 June 1987' > "$input"
# Extract all dates
awk -F '\t' 'match($3,/null/) { $3 = "0000-01-01" } { print $3 }' "$input" > "$temp1"
# transform the dates
date --file="$temp1" +%F > "$temp2"
# merge csv with transformed date
awk -F '\t' -v OFS='\t' 'BEGIN {while ( getline < "'"$temp2"'" > 0 ){f1_counter++; if ($0 == "0000-01-01") {$0 = "null"}; date[f1_counter] = $0}}{$3 = date[NR]}1' "$input" > "$output"
# print the output
cat "$output"
# cleanup
rm "$temp1" "$temp2" "$output"
#rm "$input"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.