简体   繁体   English

条件Awk hashmap匹配查找

[英]Conditional Awk hashmap match lookup

I have 2 tabular files. 我有2个表格文件。 One file contains a mapping of 50 key values only called lookup_file.txt. 一个文件包含50个键值的映射,仅称为lookup_file.txt。 The other file has the actual tabular data with 30 columns and millions of rows. 另一个文件具有30列和数百万行的实际表格数据。 data.txt I would like to replace the id column of the second file with the values from the lookup_file.txt. data.txt我想用lookup_file.txt中的值替换第二个文件的id列 .

How can I do this? 我怎样才能做到这一点? I would prefer using awk in bash script.. Also, Is there a hashmap data-structure i can use in bash for storing the 50 key/values rather than another file? 我更喜欢在bash脚本中使用awk ..此外,是否有一个hashmap数据结构我可以在bash中用于存储50个键/值而不是另一个文件?

Assuming your files have comma-separated fields and the "id column" is field 3: 假设您的文件有逗号分隔的字段,“id列”是字段3:

awk '
BEGIN{ FS=OFS="," }
NR==FNR { map[$1] = $2; next }
{ $3 = map[$3]; print }
' lookup_file.txt data.txt

If any of those assumptions are wrong, clue us in if the fix isn't obvious... 如果这些假设中的任何一个是错误的,请告诉我们如果修复不明显......

EDIT: and if you want to avoid the (IMHO negligible) NR==FNR test performance impact, this would be one of those every rare cases when use of getline is appropriate: 编辑:如果你想避免(恕我直言可以忽略不计)NR == FNR测试性能影响,这将是使用getline时适用的每种罕见情况之一:

awk '
BEGIN{
   FS=OFS=","
   while ( (getline line < "lookup_file.txt") > 0 ) {
      split(line,f)
      map[f[1]] = f[2]
   }
}
{ $3 = map[$3]; print }
' data.txt

You could use a mix of "sort" and "join" via bash instead of having to write it in awk/sed and it is likely to be even faster: 您可以通过bash混合使用“sort”和“join”,而不必将其写入awk / sed,它可能更快:

key.cvs (id, name) key.cvs(id,name)

1,homer
2,marge
3,bart
4,lisa
5,maggie

data.cvs (name,animal,owner,age) data.cvs(姓名,动物,所有者,年龄)

snowball,dog,3,1
frosty,yeti,1,245
cujo,dog,5,4

Now, you need to sort both files first on the user id columns: 现在,您需要首先在用户ID列上对这两个文件进行排序:

cat key.cvs | sort -t, -k1,1 > sorted_keys.cvs
cat data.cvs | sort -t, -k3,3 > sorted_data.cvs

Now join the 2 files: 现在加入2个文件:

join -1 1 -2 3 -o "2.1 2.2 1.2 2.4" -t , sorted_keys.cvs sorted_data.cvs > replaced_data.cvs

This should produce: 这应该产生:

snowball,dog,bart,1
frosty,yeti,homer,245
cujo,dog,maggie,4

This: 这个:

-o "2.1 2.2 1.2 2.4"

Is saying what columns from the 2 files you want in your final output. 在最终输出中说出你想要的2个文件中的哪些列。

It is pretty fast for finding and replacing multiple gigs of data compared to other scripting languages. 与其他脚本语言相比,查找和替换多个数据的速度非常快。 I haven't done a direct comparison to SED/AWK, but it is much easier to write a bash script wrapping this than writing in SED/AWK (for me at least). 我还没有直接与SED / AWK进行比较,但编写一个包装它的bash脚本比写入SED / AWK(至少对我来说)要容易得多。

Also, you can speed up the sort by using an upgraded version of gnu coreutils so that you can do the sort in parallel 此外,您可以使用gnu coreutils的升级版本加快排序速度,以便您可以并行执行排序

cat data.cvs | sort --parallel=4 -t, -k3,3 > sorted_data.cvs

4 being how many threads you want to run it in. I was recommended 2 threads per machine core will usually max out the machine, but if it is dedicated just for this, that is fine. 4是你要运行它的线程数。我建议每个机器核心2个线程通常会最大化机器,但如果它专用于此,那很好。

There are several ways to do this. 有几种方法可以做到这一点。 But if you want an easy one liner, without much in the way of validation I would go with an awk/sed solution. 但是如果你想要一个简单的一个衬垫,没有太多的验证方式,我会选择awk / sed解决方案。

Assume the following: 假设如下:

  1. the files are tab delimited 文件是制表符分隔的

  2. you are using bash shell 你正在使用bash shell

  3. the id in the data file is in the first column 数据文件中的id位于第一列

  4. your files look like this: 你的文件看起来像这样:

lookup 抬头

1   one
2   two
3   three
4   four
5   five

data 数据

1   col2    col3    col4    col5
2   col2    col3    col4    col5
3   col2    col3    col4    col5
4   col2    col3    col4    col5
5   col2    col3    col4    col5

I would use awk and sed to accomplish this task like this: 我会使用awksed来完成这个任务,如下所示:

awk '{print "sed -i s/^"$1"/"$2"/ data"}' lookup | bash

what this is doing is going through each line of lookup and writing the following to stdout 这样做是通过每行查找并将以下内容写入stdout

sed -is/^1/one/ data

sed -is/^2/two/ data

and so on. 等等。

it next pipes each line to the shell ( | bash ), which will execute the sed expression. 它接下来将每一行传递给shell( | bash ),它将执行sed表达式。 -i for inplace, you may want -i.bak to create a backup file. -i for -i.bak ,您可能需要-i.bak来创建备份文件。 note you can change the extension to whatever you would like. 请注意,您可以将扩展名更改为您想要的任何内容。 the sed is looking for the id at the start of the line, as indicated by the ^ . sed正在查找行开头的id,如^ You don't want to be replacing an 'id' in a column that might not contain an id. 您不希望替换可能不包含id的列中的“id”。

your output would look like the following: 您的输出将如下所示:

one     col2    col3    col4    col5
two     col2    col3    col4    col5
three   col2    col3    col4    col5
four    col2    col3    col4    col5
five    col2    col3    col4    col5

of course, your ids are probably not simply 1 to one, 2 to two, etc, but this might get you started in the right direction. 当然,你的id可能不是简单的1比1,2比2等,但这可能会让你开始朝着正确的方向前进。 And I use the term right very loosely. 我非常宽松地使用右边的术语。

The way I'd do this is to use awk to write an awk program to process the larger file: 我这样做的方法是使用awk编写一个awk程序来处理更大的文件:

awk -f <(awk '
   BEGIN{print " BEGIN{"}
        {printf "      a[\"%s\"]=\"%s\";",$1,$2}
   END  {print "      }";
         print "      {$1=a[$1];print $0}"}
   ' lookup_file.txt
) data.txt

That assumes that the id column is column 1; 这假设id列是第1列; if not, you need to change both instances of $1 in $1=a[$1] 如果没有,你需要改变$1 $1=a[$1]两个实例$1=a[$1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM