简体   繁体   English

如果file1的第一列与file2中的任何字符串匹配,则将其替换为file1的第二列

[英]If the first column of file1 matches any string in file2, then replace it with the second column of file1

I have this problem and I haven't solved it... I would like to manipulate this files.. if the first column of file1 matches any string in file2, then replace it with the second column of file1... then collapse it (I mean,I need just unique values per field or "cell" in second column of the output_file).. 我有这个问题,但我还没有解决...我想操作这个文件..如果file1的第一列与file2中的任何字符串匹配,则将其替换为file1的第二列...然后将其折叠(我的意思是,在output_file的第二列中,每个字段或“单元格”只需要唯一的值)。
It doesn't matter which language solves this (awk, perl, python)... files contains 100000 lines or more... I've been trying one-line awk scripts, but nothing... 哪种语言都可以解决此问题(awk,perl,python)...文件包含100000行或更多行...我一直在尝试单行awk脚本,但没有任何问题...

Any help appreciated. 任何帮助表示赞赏。
Regards 问候

file1.txt FILE1.TXT

ID100000360640  ITEM1;ITEM2  
ID100000360638  ITEM1;ITEM3  
ID100000360644  ITEM1;ITEM4  
ID100000363115  ITEM5;ITEM2;ITEM3  
ID100000363116  ITEM1;ITEM7  
ID100000382126  ITEM8;ITEM1  
ID100000002165  ITEM1;ITEM2;ITEM3;ITEM9  
ID100000002596  ITEM1;ITEM10  
ID100000003084  ITEM1  

file2.txt FILE2.TXT

ID200000000419  ID100000360638;ID100000360640;ID100000360644;ID100000394921
ID200000000938 ID100000363115;ID100000363116;ID100000363117;ID100000382126  
ID200000001036  ID100000002165;ID100000398119 

output_expected.txt output_expected.txt

ID200000000419  ITEM1;ITEM3;ITEM1;ITEM2;ITEM1;ITEM4;ID100000394921  
ID200000000938  ITEM5;ITEM2;ITEM3;ITEM1;ITEM7;ID100000363117;ITEM8;ITEM1;  
ID200000001036  ITEM1;ITEM2;ITEM3;ITEM9;ID100000398119  

processed_output.txt processed_output.txt

ID200000000419  ITEM1;ITEM2;ITEM3;ITEM4;ID100000394921  
ID200000000938  ITEM1;ITEM2;ITEM3;ITEM5;ITEM7;ITEM8;ID100000363117;  
ID200000001036  ITEM1;ITEM2;ITEM3;ITEM9;ID100000398119 

Thanks 谢谢

Through python3. 通过python3。

#!/usr/bin/python3
with open('file1.txt') as f, open('file2.txt') as r:
    d = {}
    m = f.read()
    for line in m.split('\n'):
        try:
            d.update(dict([tuple(line.split())]))
        except:
            pass
    j = r.read()
    for k in d:
        j = j.replace(k, d[k])
    print(j)    

Output: 输出:

ID200000000419  ITEM1;ITEM3;ITEM1;ITEM2;ITEM1;ITEM4;ID100000394921
ID200000000938 ITEM5;ITEM2;ITEM3;ITEM1;ITEM7;ID100000363117;ITEM8;ITEM1  
ID200000001036  ITEM1;ITEM2;ITEM3;ITEM9;ID100000398119 

This produces the collapsed output: 这将产生折叠的输出:

$ awk 'FNR==NR{a[$1]=$2;next} {c="";delete d;delete e;split($2, b, /;/);for (i in b)c=c";"(a[b[i]]?a[b[i]]:b[i]);split(substr(c,2),d,/;/); for(i in d)e[d[i]]=1; c=""; for (i in e){c=c";"i}; print $1,substr(c,2)}' file1.txt file2.txt
ID200000000419 ID100000394921;ITEM1;ITEM2;ITEM3;ITEM4
ID200000000938 ITEM1;ITEM2;ITEM3;ID100000363117;ITEM5;ITEM7;ITEM8
ID200000001036 ITEM1;ITEM2;ITEM3;ID100000398119;ITEM9

How it works 这个怎么运作

  • FNR==NR{a[$1]=$2;next}

    While we are reading the first file, this creates an associative array a which associates the first field as a key with the second as a value. 当我们读取第一个文件时,这将创建一个关联数组a ,该数组将第一个字段作为键与第二个字段作为值相关联。 Thus, the value of a[ID100000360640] is ITEM1;ITEM2 . 因此, a[ID100000360640]值为ITEM1;ITEM2 This is done for all lines of file1.txt . 这是对file1.txt所有行完成的。 The next statement causes all the remaining commands to be skipped and jumps to the next line. next语句使所有剩余的命令被跳过,并跳转到下一行。

  • c="";delete d;delete e

    If we have gotten here, that means that we are working on the second file, file2.txt . 如果file2.txt这里,则意味着我们正在处理第二个文件file2.txt These three commands initialize variable c and arrays d and e for the new line. 这三个命令为新行初始化变量c以及数组de

  • split($2, b, /;/)

    This splits the second field on semicolons and assigns the result to array b . 这将在分号上拆分第二个字段,并将结果分配给数组b

  • for (i in b)c=c";"(a[b[i]]?a[b[i]]:b[i])

    This creates the uncompressed output. 这将创建未压缩的输出。

  • split(substr(c,2),d,/;/); for(i in d)e[d[i]]=1

    This creates a associative array e whose keys are each of the fields in the uncompressed output. 这将创建一个关联数组e其键是未压缩输出中的每个字段。

  • c=""

    This initializes c again to an empty line before we add to it the compressed output. 在将压缩输出添加到c之前,这将再次将c初始化为空行。

  • for (i in e)c=c";"i

    For each key in array e , we add the key to string c . 对于数组e每个键,我们将键添加到字符串c This creates the compressed output. 这将创建压缩的输出。

  • print $1,substr(c,2)

    This prints the complete compressed line. 这将打印完整的压缩行。

Reasonably short awk way 合理的awk方式

awk 'FNR==NR{a[$1]=$2;next}
     {for(i in a)gsub(i,a[i])
      x=split($2,b,";")
      for(i=1;i<=x;i++)y!~b[i]";"&&y=y?y";"b[i]:b[i];$2=y;y=""}1' file file2

Output 产量

ID200000000419 ITEM1;ITEM3;ITEM2;ITEM4;ID100000394921
ID200000000938 ITEM5;ITEM2;ITEM3;ITEM1;ITEM7;ID100000363117;ITEM8
ID200000001036 ITEM1;ITEM2;ITEM3;ITEM9;ID100000398119

How it works 这个怎么运作

FNR==NR{a[$1]=$2;next}

When the File Record Number matches total Record Number(effectively means whilst reading the first file) assign the second field to an array using the first field as a key. 当文件记录号与总记录号匹配时(有效地意味着在读取第一个文件的同时),使用第一个字段作为键将第二个字段分配给数组。 Next means skip all further instructions and go to the next record. Next意味着跳过所有进一步的说明,然后转到下一条记录。

for(i in a)gsub(i,a[i])

Now we are in the second file as FNR!=NR anymore. 现在,我们以FNR!= NR的形式进入第二个文件。
For each element in the array gsub swaps everything that matches the key with what is contained in the array. 对于数组中的每个元素, gsub都会将与键匹配的所有内容与数组中包含的内容进行交换。

x=split($2,b,";")

Split the second field into array b separated by a ; 将第二个字段拆分为数组b,并用a分隔; .
Assign the size of the array to x. 将数组的大小分配给x。

for(i=1;i<=x;i++)

Loop from to the size of the array. 从循环到数组的大小。

y!~b[i]";"&&  

if variable y already contains the split value in b then don't continue. 如果变量y已经包含b中的分割值,则不要继续。

y=y?y";"b[i]:b[i] 

if y exists add the value in b[i] to the end or else just set y to b[i]. 如果y存在,则将b [i]中的值添加到末尾,否则只需将y设置为b [i]。

$2=y;y=""

Set second field to value in y(our new string) and reset y to nothing. 将第二个字段设置为y中的值(我们的新字符串),并将y重置为空。


Resources 资源

https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 获取文件1的相对路径(相对于文件2的路径,文件1在文件2的子文件夹中) - Get relative path of file1 (relative to path of file2, file1 is in subfolder of file2) 如何打印文件 1 和文件 2 中的行,其中文件 1 中的第 9 列小于文件 2 中的第 4 列 - How can I print lines from file1 and file2 where columns 9 in file 1 is less than column 4 in file 2 TCL 脚本从文件 1 和文件 2 中搜索字符串并将其替换为文件 1 - TCL script search string from file1 & file 2 and replace it in file1 将N行从File1复制到File2,然后删除File1中的复制行 - Copy N lines from File1 to File2, then delete copied lines in File1 如何用python中的file2中的行替换file1中的指定行 - How to replace specified lines from file1 with lines from file2 in python 在 file1 中查找单词并复制下一个单词并替换 file2 中的脚本 - Script to look for a word in file1 and copy the next word and replace that in file2 使用file1中的数据更新file2中的记录 - update records in file2 with data found in file1 测试file1中的行是否是file2中的行的子集 - Test if the lines in file1 are a subset of the lines in file2 如果文件1中存在Python,如何删除文件2中的所有字符串? - How to delete all strings in file2 if exist in file1 with Python? 比较2个文件并删除file2中与file1中找到的值匹配的任何行 - Compare 2 files and remove any lines in file2 when they match values found in file1
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM