简体   繁体   English

在Ksh中使用Awk的类似于Vlookup的函数

[英]Vlookup-like function using awk in ksh

Disclaimers: 免责声明:

1) English is my second language, so please forgive any gramatical horrors you may find. 1)英语是我的第二语言,因此请原谅您可能会发现的任何语法恐怖。 I am pretty confident you will be able to understand what I need despite these. 我非常有信心,尽管有这些,您仍然能够理解我的需求。

2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs. 2)我很遗憾地在这个站点上找到了一些示例,这些示例解决了与我的问题类似的问题,尽管不幸的是,我无法弄清楚需要进行哪些修改才能满足我的需求。

3) You will find some text in capital letters here and there. 3)您会在此找到大写字母的文字。 Is is of course not me "shouting" at you, but only a way to make portions of text stand out. 当然不是我对您“喊”,而只是使文本部分突出的一种方法。 Plase do not consider this an act of unpoliteness. Plase认为这不是不礼貌的行为。

4) For those of you who get to the bottom of this novella alive, THANKS IN ADVANCE for your patience, even if you do not get to be able to/feel like help/ing me. 4)对于那些活着活着中篇小说的人,即使您无法获得帮助/帮助我,也要感谢您的耐心等待。 My disclamer here would be the fact that, after surfing the site for a while, I noticed that the most common "complaint" from people willing to help seems to be lack of information (and/or the lack of quality) provided by the ones seeking for help. 在此我感到不满的事实是,在网站上浏览了一段时间之后,我注意到愿意提供帮助的人们最常见的“投诉”似乎是他们提供的信息不足(和/或缺乏质量)。寻求帮助。 I then preferred to be accused of overwording if need be... It would be, at least, not a common offense... 然后,如果需要的话,我宁愿被指控改写……至少这不是常见的冒犯行为……


The "Problem": 问题”:

I have 2 files (a and b for simplification). 我有2个文件(为简化起见,a和b)。 File a has 7 columns separated by commas. 文件a有7列,以逗号分隔。 File b has 2 columns separated by commas. 文件b有两列,用逗号分隔。

What I need: Whenever the data in the 7th column of file a matches -EXACT MATCHES ONLY- the data on the 1st column of file b, a new line, containing the whole line of file a plus column 2 of file b is to be appended into a new file "c". 我需要什么:只要文件a的第7列中的数据匹配-仅精确匹配-文件b的第1列上的数据,就要换行,包含文件b的整行以及文件b的第2列附加到新文件“ c”中。

--- MORE INFO IN THE NOTES AT THE BOTTOM --- ---底部注释中的更多信息---

file a: 提交一个:

Server Name,File System,Path,File,Date,Type,ID
horror,/tmp,foldera/folder/b/folderc,binaryfile.bin,2014-01-21 22:21:59.000000,typet,aaaaaaaa
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333
hostile,/sad,folder22,higefile.hug,2016-06-17 18:43:12.000000,typeasd,77777777
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999

file b: 文件b:

ID,Size
11111111,215915
22222222,1716
33333333,212856
44444444,1729
55555555,215927
66666666,1728
88888888,1729
99999999,213876
bbbbbbbb,26669080

Expected file c: 预期文件c:

Server Name,File System,Path,File,Date,Type,ID,Size
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111,215915
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222,1716
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666,1728
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333,212856
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444,1729
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555,215927
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999,213876

Additional notes: 补充笔记:

0) Notice how line with ID "aaaaaaaa" in file a does not make it into file c since ID "aaaaaaaa" is not present in file b. 0)注意文件a中ID为“ aaaaaaaa”的行如何不进入文件c,因为文件b中不存在ID“ aaaaaaaa”。 Likewise, line with ID "bbbbbbbb" in file b does not make it into file c since ID "bbbbbbbb" is not present in file a and it is therefore never looked out for in the first place. 同样,文件b中ID为“ bbbbbbbb”的行也不会进入文件c,因为文件a中不存在ID“ bbbbbbbb”,因此从一开始就不会寻找它。

1) Data is clearly completely made out due to confidenciality issues, though the examples provided fairly resemble what the real files look like. 1)由于保密性问题,虽然显然提供的示例与真实文件的外观十分相似,但数据显然已完全制成。

2) I added headers just to provide a better idea of the nature of the data. 2)我添加了标头只是为了更好地了解数据的性质。 The real files don't have it, so no need to skip them on the source file nor create it in the destination file. 实际文件没有此文件,因此无需在源文件上跳过它们或在目标文件中创建它们。

3) Both files come sorted by default, meaning that IDs will be properly sorted in file b, while they will be most likely scrambled in file a. 3)默认情况下,两个文件都进行排序,这意味着ID将在文件b中正确排序,而ID很有可能在文件a中被加密。 File c should preferably follow the order of file a (though I can manipulate later to fit my needs anyway, so no worries there, as long as the code does what I need and doesn't mess up with the data by combining the wrong lines). 文件c应该最好遵循文件a的顺序(尽管以后我仍然可以操作以满足我的需要,所以在那里不用担心,只要代码满足我的需要并且不会通过组合错误的行而使数据混乱)。

4) VERY VERY VERY IMPORTANT: 4)非常非常重要:

4.a) I already have a "working" ksh code (attached below) that uses "cat", "grep", "while" and "if" to do the job. 4.a)我已经有一个“有效的” ksh代码(如下所示),它使用“ cat”,“ grep”,“ while”和“ if”来完成工作。 It worked like a charm (well, acceptably) with 160K-lines sample files (it was able to output 60K lines -approx- an hour, which, in projection, would yield an acceptable "20 days" to produce 30 million lines [KEEP ON READING]), but somehow (I have plenty of processor and memory capacity) cat and/or grep seem to be struggling to process a real life 5Million-lines file (both file a and b can have up to 30 million lines each, so that's the maximum probable amount of lines in the resulting file, even assuming 100% lines in file a find it's match in file b) and the c file is now only being feed with a couple hundred lines every 24 hours. 它具有16万行样本文件的魅力(很好,可以接受)(它可以输出大约60K行-大约一个小时,按计划,这将产生一个可接受的“ 20天”来生产3000万行[KEEP阅读中]),但是不知何故(我有足够的处理器和内存容量),cat和/或grep似乎在处理现实生活中的500万行文件(文件a和b最多可以有3000万行,因此,这就是结果文件中最大的行数,即使假设文件a中的100%行与文件b)中的行匹配,现在c文件每24小时仅馈送几百行。

4.b) I was told that awk, being stronger, should succeed where the more weaker commands I worked with seem to fail. 4.b)有人告诉我,awk(更强大)应该在我使用的较弱的命令似乎失败的情况下成功。 I was also told that working with arrays might be the solution to my performance problem, since all data is uploded to memory at once and worked from there, instead of having to cat | 我还被告知,使用数组可能是解决我的性能问题的方法,因为所有数据都立即升华到内存中并从那里开始工作,而不必费力。 grep file b as many times as there are lines in file a, as I am currently doing. 正如我目前所做的那样,grep文件b的次数与文件a中的行的次数相同。

4.c) I am working on AIX, so I only have sh and ksh, no bash, therefore I cannot use the array tools provided by the latter, that's why I thought of AWK, that and the fact that I think AWK is probably "stronger", though I might be (probably?) wrong. 4.c)我在AIX上工作,所以我只有sh和ksh,没有bash,因此我不能使用后者提供的数组工具,这就是为什么我想到AWK的原因,而且我认为AWK可能是“更强”,尽管我可能是错误的。

Now, I present to you the magnificent piece of ksh code (obvious sarcasm here, though I like the idea of you picturing for a brief moment in your mind the image of the monkey holding up and showing all other jungle-crawlers their future lion king) I have managed to develop (feel free to laugh as hard as you need while reading this code, I will not be able to hear you anyway, so no feelings harmed :P ): 现在,我向您展示一段宏伟的ksh代码(这里很明显地讽刺,尽管我喜欢您在脑海中短暂描绘一下猴子举起并向所有其他丛林爬行者展示他们未来的狮子王的形象的想法。 )我已经设法发展(在阅读此代码时,您可以根据需要随意笑,无论如何我都无法听到您的声音,因此不会伤害您的:P):

cat "${file_a}" | while read -r line_file_a; do

    server_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $1}'`
    filespace_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $2}'`
    folder_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $3}'`
    file_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $4}'`
    file_date_file_a=`echo "${line_file_a}" | awk -F"," '{print $5}'`
    file_type_file_a=`echo "${line_file_a}" | awk -F"," '{print $6}'`
    file_id_file_a=`echo "${line_file_a}" | awk -F"," '{print $7}'`

    cat "${file_b}" | grep ${object_id_file_a} | while read -r line_file_b; do

        file_id_file_b=`echo "${line_file_b}" | awk -F"," '{print $1}'`
        file_size_file_b=`echo "${line_file_b}" | awk -F"," '{print $2}'`

        if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then

            echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}" >> ${file_c}.csv

        fi

    done

done

One last additional note, just in case you wonder: 最后一个补充说明,以防万一您想知道:

The "if" section was not only built as a mean to articulate the output line, but it servers a double purpose, while safe-proofing any false positives that may derive from grep, IE 100 matching 1000 (Bear in mind that, as I mentioned earlier, I am working on AIX, so my grep does not have the -m switch the GNU one has, and I need matches to be exact/absolute). “ if”部分不仅用作表达输出线的手段,而且还具有双重目的,同时安全地验证可能源自grep(IE 100匹配1000)的任何误报(请注意,前面提到过,我正在AIX上工作,因此我的grep没有使用GNU所具有的-m开关,因此我需要匹配(精确/绝对)。

You have reached the end. 您已经到了尽头。 CONGRATULATIONS! 恭喜! You've been awarded the medal to patience. 您已获得耐心奖牌。

$ cat stuff.awk
BEGIN { FS=OFS="," }
NR == FNR { a[$1] = $2; next }
$7 in a { print $0, a[$7] }

Note the order for providing the files to the awk command, b first, followed by a : 请注意将文件提供给awk命令的顺序,首先是b ,然后是a

$ awk -f stuff.awk b.txt a.txt
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111,215915
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222,1716
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666,1728
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333,212856
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444,1729
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555,215927
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999,213876

EDIT: Updated calculation You can try to predict how often you are calling another program: 编辑:更新的计算您可以尝试预测您调用另一个程序的频率:
At least 7 awk's + 1 cat + 1 grep for each line in file a multiplied by 2 awk's for each line in file b. 文件a中每行至少7 awk +1 cat + 1 grep乘以文件b中每行2 awk。 (9 * 160.000). (9 * 160.000)。
For file b: 2 awk's, one file open and one file close for each hit. 对于文件b:2个awk,每次命中一个文件打开,一个文件关闭。 With 60K output, that would be 4 * 60.000. 如果输出为60K,则为4 * 60.000。

A small change in the code can change this into "only" 160.000 times a grep: 对代码进行少量更改即可将其更改为grep的“仅” 160.000倍:

cat "${file_a}" | while IFS=, read -r server_name_file_a \
   filespace_name_file_a folder_name_file_a file_name_file_a \
   file_date_file_a file_type_file_a file_id_file_a; do
   grep "${object_id_file_a}" "${file_b}" | while IFS="," read -r line_file_b; do
        if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
            echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}" 
        fi
    done
done >> ${file_c}.csv

Well, try this with your 160K files and see how much faster it is. 好吧,尝试使用您的160K文件,看看它有多快。
Before I explain that this still is the wrong way I will make another small improvement: I will move the cat for the while loop to the end (after done ). 在我解释这仍然是错误的方法之前,我将进行另一个小的改进:我将while循环中的cat移至末尾( done后)。

while IFS=, read -r server_name_file_a \
   filespace_name_file_a folder_name_file_a file_name_file_a \
   file_date_file_a file_type_file_a file_id_file_a; do
   grep "${object_id_file_a}" "${file_b}" | while IFS="," read -r line_file_b; do
        if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
            echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}" 
        fi
    done
done < "${file_a}" >> ${file_c}.csv

The main drawback of the solutions is that you are reading the complete file_b again and again with your grep for each line in file a. 该解决方案的主要缺点是,您将使用grep反复读取文件a中每一行的完整file_b。

This solution is a nice improvement in the performance, but still a lot overhead with grep . 这个解决方案在性能上是一个不错的改进,但是grep仍然有很多开销。 Another huge improvement can be found with awk . 使用awk可以发现另一个巨大的改进。
The best solution is using awk as explained in What is "NR==FNR" in awk? 最佳解决方案是使用awk,如awk中的“ NR == FNR”是什么? and found in the answer of @jas. 并在@jas的答案中找到。 It is only one system call and both files are only read once. 这只是一个系统调用,并且两个文件都只能读取一次。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM