使用awk在列中包含格式的文件名

Question

I'm working on wrangling some data to ingest into Hive. 我正在努力整理一些数据以提取到Hive中。 The problem is, I have overwrites in my historical data so I need to include the file name in the text files so that I can dispose of the duplicated rows which have been updated in subsequent files. 问题是，我的历史数据已被覆盖，因此我需要在文本文件中包含文件名，以便可以处理在后续文件中已更新的重复行。

The way I've chosen to go about this is to use awk to add the file name to each file, then after I ingest into Hive I can use HQL to filter out my deprecated rows. 我选择执行此操作的方法是使用awk将文件名添加到每个文件中，然后在提取到Hive之后，可以使用HQL过滤掉不赞成使用的行。

Here is my sample data (tab-delimited): 这是我的示例数据（制表符分隔）：

animal  legs    eyes
hippo   4       2
spider  8       8
crab    8       2
mite    6       0
bird    2       2

I've named it long_name_20180901.txt 我已将其命名为long_name_20180901.txt

I've figured out how to add my new column from this post : 我已经想出了如何从这篇文章中添加新列：

awk '{print FILENAME (NF?"\t":"") $0}' long_name_20180901.txt

which results in: 结果是：

long_name_20180901.txt  animal  legs    eyes
long_name_20180901.txt  hippo   4       2
long_name_20180901.txt  spider  8       8
long_name_20180901.txt  crab    8       2
long_name_20180901.txt  mite    6       0
long_name_20180901.txt  bird    2       2

But, being a beginner, I don't know how to augment this command to: 但是，作为一个初学者，我不知道如何将该命令扩展为：

make the column name (first line) something like "file_name" 使列名（第一行）类似“ file_name”
implement regex in awk to just extract the part of the file name that I need, and dispose of the rest. 在awk中实现正则表达式以仅提取我需要的文件名的一部分，然后处理其余部分。 I really just want "long_name_(.{8,}).txt" (the stuff in the capturing group. 我真的只想要"long_name_(.{8,}).txt" （捕获组中的内容）。

Target output is: 目标输出是：

file  animal  legs    eyes
20180901  spider  8       8
20180901  crab    8       2
20180901  mite    6       0
20180901  bird    2       2

Thanks for your time!! 谢谢你的时间！！ I'm a total newbie to awk . 我是awk新手。

Answer 1

您可以使用BEGIN设置“文件”，然后将其重置为其余部分使用文件名。

awk 'BEGIN{f="file\t"} NF{print f $0; if (f=="file\t") {l=split(FILENAME, a, /[_.]/); f=a[l-1]"\t"};}' long_name_20180901.txt

Answer 2

This would handle one or multiple input files: 这将处理一个或多个输入文件：

awk -v OFS='\t' '
    NR==1 { print "file", $0 }
    FNR==1 { n=split(FILENAME,t,/[_.]/); fname=t[n-1]; next }
    { print fname, $0 }
' *.txt

使用awk在列中包含格式的文件名

问题描述

2 个解决方案

解决方案1
1 2019-02-26 20:23:11

解决方案2
1 已采纳 2019-02-26 20:31:50

使用awk在列中包含格式的文件名

问题描述

2 个解决方案

解决方案1 1 2019-02-26 20:23:11

解决方案2 1 已采纳 2019-02-26 20:31:50

解决方案1
1 2019-02-26 20:23:11

解决方案2
1 已采纳 2019-02-26 20:31:50