简体   繁体   中英

Using awk to include file name with format in column

I'm working on wrangling some data to ingest into Hive. The problem is, I have overwrites in my historical data so I need to include the file name in the text files so that I can dispose of the duplicated rows which have been updated in subsequent files.

The way I've chosen to go about this is to use awk to add the file name to each file, then after I ingest into Hive I can use HQL to filter out my deprecated rows.

Here is my sample data (tab-delimited):

animal  legs    eyes
hippo   4       2
spider  8       8
crab    8       2
mite    6       0
bird    2       2

I've named it long_name_20180901.txt

I've figured out how to add my new column from this post :

awk '{print FILENAME (NF?"\t":"") $0}' long_name_20180901.txt

which results in:

long_name_20180901.txt  animal  legs    eyes
long_name_20180901.txt  hippo   4       2
long_name_20180901.txt  spider  8       8
long_name_20180901.txt  crab    8       2
long_name_20180901.txt  mite    6       0
long_name_20180901.txt  bird    2       2

But, being a beginner, I don't know how to augment this command to:

  1. make the column name (first line) something like "file_name"
  2. implement regex in awk to just extract the part of the file name that I need, and dispose of the rest. I really just want "long_name_(.{8,}).txt" (the stuff in the capturing group.

Target output is:

file  animal  legs    eyes
20180901  spider  8       8
20180901  crab    8       2
20180901  mite    6       0
20180901  bird    2       2

Thanks for your time!! I'm a total newbie to awk .

您可以使用BEGIN设置“文件”,然后将其重置为其余部分使用文件名。

awk 'BEGIN{f="file\t"} NF{print f $0; if (f=="file\t") {l=split(FILENAME, a, /[_.]/); f=a[l-1]"\t"};}' long_name_20180901.txt

This would handle one or multiple input files:

awk -v OFS='\t' '
    NR==1 { print "file", $0 }
    FNR==1 { n=split(FILENAME,t,/[_.]/); fname=t[n-1]; next }
    { print fname, $0 }
' *.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM