如果file1的第一列与file2中的任何字符串匹配，则将其替换为file1的第二列

Question

I have this problem and I haven't solved it... I would like to manipulate this files.. if the first column of file1 matches any string in file2, then replace it with the second column of file1... then collapse it (I mean,I need just unique values per field or "cell" in second column of the output_file).. 我有这个问题，但我还没有解决...我想操作这个文件..如果file1的第一列与file2中的任何字符串匹配，则将其替换为file1的第二列...然后将其折叠（我的意思是，在output_file的第二列中，每个字段或“单元格”只需要唯一的值）。
It doesn't matter which language solves this (awk, perl, python)... files contains 100000 lines or more... I've been trying one-line awk scripts, but nothing... 哪种语言都可以解决此问题（awk，perl，python）...文件包含100000行或更多行...我一直在尝试单行awk脚本，但没有任何问题...

Any help appreciated. 任何帮助表示赞赏。
Regards 问候

file1.txt FILE1.TXT

ID100000360640  ITEM1;ITEM2  
ID100000360638  ITEM1;ITEM3  
ID100000360644  ITEM1;ITEM4  
ID100000363115  ITEM5;ITEM2;ITEM3  
ID100000363116  ITEM1;ITEM7  
ID100000382126  ITEM8;ITEM1  
ID100000002165  ITEM1;ITEM2;ITEM3;ITEM9  
ID100000002596  ITEM1;ITEM10  
ID100000003084  ITEM1

file2.txt FILE2.TXT

ID200000000419  ID100000360638;ID100000360640;ID100000360644;ID100000394921
ID200000000938 ID100000363115;ID100000363116;ID100000363117;ID100000382126  
ID200000001036  ID100000002165;ID100000398119

output_expected.txt output_expected.txt

ID200000000419  ITEM1;ITEM3;ITEM1;ITEM2;ITEM1;ITEM4;ID100000394921  
ID200000000938  ITEM5;ITEM2;ITEM3;ITEM1;ITEM7;ID100000363117;ITEM8;ITEM1;  
ID200000001036  ITEM1;ITEM2;ITEM3;ITEM9;ID100000398119

processed_output.txt processed_output.txt

ID200000000419  ITEM1;ITEM2;ITEM3;ITEM4;ID100000394921  
ID200000000938  ITEM1;ITEM2;ITEM3;ITEM5;ITEM7;ITEM8;ID100000363117;  
ID200000001036  ITEM1;ITEM2;ITEM3;ITEM9;ID100000398119

Thanks 谢谢

Answer 1

Through python3. 通过python3。

#!/usr/bin/python3
with open('file1.txt') as f, open('file2.txt') as r:
    d = {}
    m = f.read()
    for line in m.split('\n'):
        try:
            d.update(dict([tuple(line.split())]))
        except:
            pass
    j = r.read()
    for k in d:
        j = j.replace(k, d[k])
    print(j)

Output: 输出：

ID200000000419  ITEM1;ITEM3;ITEM1;ITEM2;ITEM1;ITEM4;ID100000394921
ID200000000938 ITEM5;ITEM2;ITEM3;ITEM1;ITEM7;ID100000363117;ITEM8;ITEM1  
ID200000001036  ITEM1;ITEM2;ITEM3;ITEM9;ID100000398119

Answer 2

This produces the collapsed output: 这将产生折叠的输出：

$ awk 'FNR==NR{a[$1]=$2;next} {c="";delete d;delete e;split($2, b, /;/);for (i in b)c=c";"(a[b[i]]?a[b[i]]:b[i]);split(substr(c,2),d,/;/); for(i in d)e[d[i]]=1; c=""; for (i in e){c=c";"i}; print $1,substr(c,2)}' file1.txt file2.txt
ID200000000419 ID100000394921;ITEM1;ITEM2;ITEM3;ITEM4
ID200000000938 ITEM1;ITEM2;ITEM3;ID100000363117;ITEM5;ITEM7;ITEM8
ID200000001036 ITEM1;ITEM2;ITEM3;ID100000398119;ITEM9

How it works 这个怎么运作

FNR==NR{a[$1]=$2;next}

While we are reading the first file, this creates an associative array a which associates the first field as a key with the second as a value. 当我们读取第一个文件时，这将创建一个关联数组a ，该数组将第一个字段作为键与第二个字段作为值相关联。 Thus, the value of a[ID100000360640] is ITEM1;ITEM2 . 因此， a[ID100000360640]值为ITEM1;ITEM2 。 This is done for all lines of file1.txt . 这是对file1.txt所有行完成的。 The next statement causes all the remaining commands to be skipped and jumps to the next line. 在next语句使所有剩余的命令被跳过，并跳转到下一行。
c="";delete d;delete e

If we have gotten here, that means that we are working on the second file, file2.txt . 如果file2.txt这里，则意味着我们正在处理第二个文件file2.txt 。 These three commands initialize variable c and arrays d and e for the new line. 这三个命令为新行初始化变量c以及数组d和e 。
split($2, b, /;/)

This splits the second field on semicolons and assigns the result to array b . 这将在分号上拆分第二个字段，并将结果分配给数组b 。
for (i in b)c=c";"(a[b[i]]?a[b[i]]:b[i])

This creates the uncompressed output. 这将创建未压缩的输出。
split(substr(c,2),d,/;/); for(i in d)e[d[i]]=1

This creates a associative array e whose keys are each of the fields in the uncompressed output. 这将创建一个关联数组e其键是未压缩输出中的每个字段。
c=""

This initializes c again to an empty line before we add to it the compressed output. 在将压缩输出添加到c之前，这将再次将c初始化为空行。
for (i in e)c=c";"i

For each key in array e , we add the key to string c . 对于数组e每个键，我们将键添加到字符串c 。 This creates the compressed output. 这将创建压缩的输出。
print $1,substr(c,2)

This prints the complete compressed line. 这将打印完整的压缩行。

Answer 3

Reasonably short awk way 合理的awk方式

awk 'FNR==NR{a[$1]=$2;next}
     {for(i in a)gsub(i,a[i])
      x=split($2,b,";")
      for(i=1;i<=x;i++)y!~b[i]";"&&y=y?y";"b[i]:b[i];$2=y;y=""}1' file file2

Output 产量

ID200000000419 ITEM1;ITEM3;ITEM2;ITEM4;ID100000394921
ID200000000938 ITEM5;ITEM2;ITEM3;ITEM1;ITEM7;ID100000363117;ITEM8
ID200000001036 ITEM1;ITEM2;ITEM3;ITEM9;ID100000398119

How it works 这个怎么运作

FNR==NR{a[$1]=$2;next}

When the File Record Number matches total Record Number(effectively means whilst reading the first file) assign the second field to an array using the first field as a key. 当文件记录号与总记录号匹配时（有效地意味着在读取第一个文件的同时），使用第一个字段作为键将第二个字段分配给数组。 Next means skip all further instructions and go to the next record. Next意味着跳过所有进一步的说明，然后转到下一条记录。

for(i in a)gsub(i,a[i])

Now we are in the second file as FNR!=NR anymore. 现在，我们以FNR！= NR的形式进入第二个文件。
For each element in the array gsub swaps everything that matches the key with what is contained in the array. 对于数组中的每个元素， gsub都会将与键匹配的所有内容与数组中包含的内容进行交换。

x=split($2,b,";")

Split the second field into array b separated by a ; 将第二个字段拆分为数组b，并用a分隔; . 。
Assign the size of the array to x. 将数组的大小分配给x。

for(i=1;i<=x;i++)

Loop from to the size of the array. 从循环到数组的大小。

y!~b[i]";"&&

if variable y already contains the split value in b then don't continue. 如果变量y已经包含b中的分割值，则不要继续。

y=y?y";"b[i]:b[i]

if y exists add the value in b[i] to the end or else just set y to b[i]. 如果y存在，则将b [i]中的值添加到末尾，否则只需将y设置为b [i]。

$2=y;y=""

Set second field to value in y(our new string) and reset y to nothing. 将第二个字段设置为y中的值（我们的新字符串），并将y重置为空。

Resources 资源

https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

如果file1的第一列与file2中的任何字符串匹配，则将其替换为file1的第二列

问题描述

3 个解决方案

解决方案1
1 2015-02-16 05:48:19

解决方案2
1 已采纳 2015-02-16 06:37:39

How it works 这个怎么运作

解决方案3
1

Output 产量

How it works 这个怎么运作

Resources 资源

如果file1的第一列与file2中的任何字符串匹配，则将其替换为file1的第二列

问题描述

3 个解决方案

解决方案1 1 2015-02-16 05:48:19

解决方案2 1 已采纳 2015-02-16 06:37:39

How it works 这个怎么运作

解决方案3 1

Output 产量

How it works 这个怎么运作

Resources 资源

解决方案1
1 2015-02-16 05:48:19

解决方案2
1 已采纳 2015-02-16 06:37:39

解决方案3
1