简体   繁体   English

在awk中将字段拆分为数组,然后在另一个文件中搜索每个术语

[英]split field into array in awk, then search each term in another file

I'm trying to de-compose a field from a specific file into an array, and then check if each term appears in a second file (which has been already stored in another array). 我正在尝试将特定文件中的字段分解为数组,然后检查每个术语是否出现在第二个文件中(该文件已经存储在另一个数组中)。 The goal is to merge information from both files. 目标是合并两个文件中的信息。

The first file1 (the one with the field I want to split) looks like that: 第一个file1 (具有要拆分的字段的文件)如下所示:

data1=data2=data3 some more stuff
data4=data1 this are things
data2=data5 more text here
...

While file2 has this structure: 虽然file2具有以下结构:

data1 10
data2 20
data3 35
data4 15
data5 60

I want to split the the first field of file1 using = , then search each of the splitted terms in the second file, and print everything in the following format: 我想使用=拆分file1的第一个字段,然后在第二个文件中搜索每个拆分的术语,并以以下格式打印所有内容:

output : output

data1=data2=data3 some more stuff 10
data1=data2=data3 some more stuff 20
data1=data2=data3 some more stuff 35
data4=data1 this are things 15
data4=data1 this are things 10
data2=data5 more text here 20
data2=data5 more text here 60

So far, I've got this: 到目前为止,我已经知道了:

awk 'NR==FNR {
l[$1] = $2; next
} {
la=split($1,a,"=")
for(x=1;x<=la;x++)
  print $0,l[a[$x]]
}' file2 file1 > output

First (when NR==FNR ), I store file2 data in the array l using the first field as key. 首先(当NR==FNR ),我使用第一个字段作为键将file2数据存储在数组l

Then I parse the next file in the following manner: for each record, I split the field $1 into an array la using = as the separator. 然后,我以以下方式解析下一个文件:对于每个记录,我将= $1用作分隔符,将字段$1拆分为数组la la variable stores the number of terms in the array a . la变量将项数存储在数组a

For each element in array a ( for loop), I look for the corresponding key in array l and output the current content + l value. 对于数组afor循环)中的每个元素,我在数组l寻找相应的键并输出当前的内容+ l值。

But, for some reason, I only get the content from file1 (current, unwanted output): 但是,由于某种原因,我只能从file1获取内容(当前的不需要的输出):

data1=data2=data3 some more stuff 
data1=data2=data3 some more stuff 
data1=data2=data3 some more stuff 
data4=data1 this are things 
data4=data1 this are things 
data2=data5 more text here 
data2=data5 more text here 

Any ideas on what might be wrong with my code? 关于我的代码可能有什么问题的任何想法?

Thanks a lot! 非常感谢!

awk to the rescue! awk解救!

If your tokens are fixed length you can do pattern match without splitting the field 如果令牌是固定长度的,则可以进行模式匹配而无需拆分字段

$ awk 'NR==FNR{a[$1]=$2;next}
              {for(k in a) if($1~k) print $0, a[k]}' file2 file1

data1=data2=data3 some more stuff 10
data1=data2=data3 some more stuff 20
data1=data2=data3 some more stuff 35
data4=data1 this are things 10
data4=data1 this are things 15
data2=data5 more text here 20
data2=data5 more text here 60

I found the answer myself. 我自己找到了答案。 It was an issue with variable naming. 这是变量命名的问题。

This is the correct code: 这是正确的代码:

awk 'NR==FNR {
l[$1] = $2; next
} {
la=split($1,a,"=")
for(x=1;x<=la;x++)
  print $0,l[a[x]]
}' file2 file1 > output

The key is in the printing function. 该键位于打印功能中。 It now reads print $0,l[a[x]] instead of print $0,l[a[$x]] . 现在print $0,l[a[x]]它读取print $0,l[a[x]]而不是print $0,l[a[$x]] The loop is using x as its internal counter, not $x . 循环使用x作为其内部计数器,而不是$x Changing that now points to the correct key in array l (from file2 ). 现在将其更改指向数组l的正确键(来自file2 )。

I'm leaving the post because it looks like this question hasn't been posed before. 我要离开该职位,因为看起来以前没有提出过这个问题。 Please tell me if you think it's not useful. 请告诉我您是否认为这没有用。

Thanks! 谢谢!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM