简体   繁体   English

使用 awk 和正则表达式匹配两个文件中的字符串

[英]Match strings in two files using awk and regexp

I have two files.我有两个文件。

File 1 includes various types of SeriesDescriptions文件 1 包括各种类型的 SeriesDescriptions

"SeriesDescription": "Type_*"
"SeriesDescription": "OtherType_*"
...

File 2 contains information with only one SeriesDescription文件 2 包含只有一个 SeriesDescription 的信息

"Name":"Joe"
"Age":"18"
"SeriesDescription":"Type_(Joe_text)"
...

I want to我想要

  1. compare the two files and find the lines that match for "SeriesDescription" and比较这两个文件并找到与“SeriesDescription”匹配的行和
  2. print the line number of the matched text from File 1.打印文件 1 中匹配文本的行号。

Expected Output:预期 Output:

"SeriesDescription": "Type_*" 24 11 (the correct line numbers in my files) "SeriesDescription": "Type_*" 24 11 (我的文件中的正确行号)

"SeriesDescription" will always be found on line 11 of File 2. I am having trouble matching given the * and have also tried changing it to .* without luck. “SeriesDescription”总是会在文件 2 的第 11 行找到。给定*匹配时我遇到了问题,并且还尝试将其更改为.*没有运气。

Code I have tried:我试过的代码:

grep -nf File1.txt File2.txt

Successfully matches, but I want the line number from File1成功匹配,但我想要 File1 中的行号

awk 'FNR==NR{l[$1]=NR; next}; $1 in l{print $0, l[$1], FNR}' File2.txt File1.txt

This finds a match and prints the line number from both files, however, this is matching on the first column and prints the last line from File 1 as the match (since every line has the same column 1 for File 1).这会找到一个匹配项并打印两个文件中的行号,但是,这在第一列上匹配,并将文件 1 中的最后一行打印为匹配项(因为每一行对于文件 1 都有相同的列 1)。

awk 'FNR==NR{l[$2]=$3;l[$2]=NR; next}; $2 in l{print $0, l[$2], FNR}' File2.txt File1.txt

Does not produce a match.不产生匹配。

I have also tried various settings of FS=":" without luck.我也尝试了FS=":"的各种设置,但没有运气。 I am not sure if the trouble is coming from the regex or the use of "" in the files or something else.我不确定问题是来自正则表达式还是文件中使用“”或其他东西。 Any help would be greatly appreciated!任何帮助将不胜感激!

With your shown samples, please try following.使用您显示的示例,请尝试以下操作。 Written and tested in GNU awk , should work in any awk .在 GNU awk中编写和测试,应该在任何awk中工作。

awk '
{ val="" }
match($0,/^[^_]*_/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/[[:space:]]+/,"",val)
}
FNR==NR{
  if(val){
    arr[val]=$0 OFS FNR
  }
  next
}
(val in arr){
  print arr[val] OFS FNR
}
' SeriesDescriptions file2

With your shown samples output will be:使用您显示的样品 output 将是:

"SeriesDescription": "Type_*" 1 3

Explanation: Adding detailed explanation for above.说明:为上述添加详细说明。

awk '                            ##Starting awk program from here.
{ val="" }                       ##Nullifying val here.
match($0,/^[^_]*_/){             ##Using match to match value till 1st occurrence of _ here.
  val=substr($0,RSTART,RLENGTH)  ##Creating val which has sub string of above matched regex.
  gsub(/[[:space:]]+/,"",val)    ##Globally substituting spaces with NULL in val here.
}
FNR==NR{                         ##This will execute when first file is being read.
  if(val){                       ##If val is NOT NULL.
    arr[val]=$0 OFS FNR          ##Create arr with index of val, which has value of current line OFS and FNR in it.
  }                       
  next                           ##next will skip all further statements from here.
}
(val in arr){                    ##Checking if val is present in arr then do following.
  print arr[val] OFS FNR         ##Printing arr value with OFS, FNR value.
}
' SeriesDescriptions file2       ##Mentioning Input_file name here.

Bonus solution: If above is working fine for you AND you have this match only once in your file2 then you can exit from program to make it quick, in that case have above code in following way.奖励解决方案:如果以上对您来说很好,并且您的文件2中只有一次匹配,那么您可以exit程序以使其快速,在这种情况下,以下列方式使用上述代码。

awk '
{ val="" }
match($0,/^[^_]*_/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/[[:space:]]+/,"",val)
}
FNR==NR{
  if(val){
    arr[val]=$0 OFS FNR
  }
  next
}
(val in arr){
  print arr[val] OFS FNR
  exit
}
' SeriesDescriptions file2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM