[英]Using awk to include file name with format in column
I'm working on wrangling some data to ingest into Hive. 我正在努力整理一些数据以提取到Hive中。 The problem is, I have overwrites in my historical data so I need to include the file name in the text files so that I can dispose of the duplicated rows which have been updated in subsequent files.
问题是,我的历史数据已被覆盖,因此我需要在文本文件中包含文件名,以便可以处理在后续文件中已更新的重复行。
The way I've chosen to go about this is to use awk
to add the file name to each file, then after I ingest into Hive I can use HQL to filter out my deprecated rows. 我选择执行此操作的方法是使用
awk
将文件名添加到每个文件中,然后在提取到Hive之后,可以使用HQL过滤掉不赞成使用的行。
Here is my sample data (tab-delimited): 这是我的示例数据(制表符分隔):
animal legs eyes
hippo 4 2
spider 8 8
crab 8 2
mite 6 0
bird 2 2
I've named it long_name_20180901.txt
我已将其命名为
long_name_20180901.txt
I've figured out how to add my new column from this post : 我已经想出了如何从这篇文章中添加新列:
awk '{print FILENAME (NF?"\t":"") $0}' long_name_20180901.txt
which results in: 结果是:
long_name_20180901.txt animal legs eyes
long_name_20180901.txt hippo 4 2
long_name_20180901.txt spider 8 8
long_name_20180901.txt crab 8 2
long_name_20180901.txt mite 6 0
long_name_20180901.txt bird 2 2
But, being a beginner, I don't know how to augment this command to: 但是,作为一个初学者,我不知道如何将该命令扩展为:
"long_name_(.{8,}).txt"
(the stuff in the capturing group. "long_name_(.{8,}).txt"
(捕获组中的内容)。 Target output is: 目标输出是:
file animal legs eyes
20180901 spider 8 8
20180901 crab 8 2
20180901 mite 6 0
20180901 bird 2 2
Thanks for your time!! 谢谢你的时间!! I'm a total newbie to
awk
. 我是
awk
新手。
您可以使用BEGIN
设置“文件”,然后将其重置为其余部分使用文件名。
awk 'BEGIN{f="file\t"} NF{print f $0; if (f=="file\t") {l=split(FILENAME, a, /[_.]/); f=a[l-1]"\t"};}' long_name_20180901.txt
This would handle one or multiple input files: 这将处理一个或多个输入文件:
awk -v OFS='\t' '
NR==1 { print "file", $0 }
FNR==1 { n=split(FILENAME,t,/[_.]/); fname=t[n-1]; next }
{ print fname, $0 }
' *.txt
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.