[英]If the first column of file1 matches any string in file2, then replace it with the second column of file1
I have this problem and I haven't solved it... I would like to manipulate this files.. if the first column of file1 matches any string in file2, then replace it with the second column of file1... then collapse it (I mean,I need just unique values per field or "cell" in second column of the output_file).. 我有这个问题,但我还没有解决...我想操作这个文件..如果file1的第一列与file2中的任何字符串匹配,则将其替换为file1的第二列...然后将其折叠(我的意思是,在output_file的第二列中,每个字段或“单元格”只需要唯一的值)。
It doesn't matter which language solves this (awk, perl, python)... files contains 100000 lines or more... I've been trying one-line awk scripts, but nothing... 哪种语言都可以解决此问题(awk,perl,python)...文件包含100000行或更多行...我一直在尝试单行awk脚本,但没有任何问题...
Any help appreciated. 任何帮助表示赞赏。
Regards 问候
file1.txt FILE1.TXT
ID100000360640 ITEM1;ITEM2
ID100000360638 ITEM1;ITEM3
ID100000360644 ITEM1;ITEM4
ID100000363115 ITEM5;ITEM2;ITEM3
ID100000363116 ITEM1;ITEM7
ID100000382126 ITEM8;ITEM1
ID100000002165 ITEM1;ITEM2;ITEM3;ITEM9
ID100000002596 ITEM1;ITEM10
ID100000003084 ITEM1
file2.txt FILE2.TXT
ID200000000419 ID100000360638;ID100000360640;ID100000360644;ID100000394921
ID200000000938 ID100000363115;ID100000363116;ID100000363117;ID100000382126
ID200000001036 ID100000002165;ID100000398119
output_expected.txt output_expected.txt
ID200000000419 ITEM1;ITEM3;ITEM1;ITEM2;ITEM1;ITEM4;ID100000394921
ID200000000938 ITEM5;ITEM2;ITEM3;ITEM1;ITEM7;ID100000363117;ITEM8;ITEM1;
ID200000001036 ITEM1;ITEM2;ITEM3;ITEM9;ID100000398119
processed_output.txt processed_output.txt
ID200000000419 ITEM1;ITEM2;ITEM3;ITEM4;ID100000394921
ID200000000938 ITEM1;ITEM2;ITEM3;ITEM5;ITEM7;ITEM8;ID100000363117;
ID200000001036 ITEM1;ITEM2;ITEM3;ITEM9;ID100000398119
Thanks 谢谢
Through python3. 通过python3。
#!/usr/bin/python3
with open('file1.txt') as f, open('file2.txt') as r:
d = {}
m = f.read()
for line in m.split('\n'):
try:
d.update(dict([tuple(line.split())]))
except:
pass
j = r.read()
for k in d:
j = j.replace(k, d[k])
print(j)
Output: 输出:
ID200000000419 ITEM1;ITEM3;ITEM1;ITEM2;ITEM1;ITEM4;ID100000394921
ID200000000938 ITEM5;ITEM2;ITEM3;ITEM1;ITEM7;ID100000363117;ITEM8;ITEM1
ID200000001036 ITEM1;ITEM2;ITEM3;ITEM9;ID100000398119
This produces the collapsed output: 这将产生折叠的输出:
$ awk 'FNR==NR{a[$1]=$2;next} {c="";delete d;delete e;split($2, b, /;/);for (i in b)c=c";"(a[b[i]]?a[b[i]]:b[i]);split(substr(c,2),d,/;/); for(i in d)e[d[i]]=1; c=""; for (i in e){c=c";"i}; print $1,substr(c,2)}' file1.txt file2.txt
ID200000000419 ID100000394921;ITEM1;ITEM2;ITEM3;ITEM4
ID200000000938 ITEM1;ITEM2;ITEM3;ID100000363117;ITEM5;ITEM7;ITEM8
ID200000001036 ITEM1;ITEM2;ITEM3;ID100000398119;ITEM9
FNR==NR{a[$1]=$2;next}
While we are reading the first file, this creates an associative array a
which associates the first field as a key with the second as a value. 当我们读取第一个文件时,这将创建一个关联数组a
,该数组将第一个字段作为键与第二个字段作为值相关联。 Thus, the value of a[ID100000360640]
is ITEM1;ITEM2
. 因此, a[ID100000360640]
值为ITEM1;ITEM2
。 This is done for all lines of file1.txt
. 这是对file1.txt
所有行完成的。 The next
statement causes all the remaining commands to be skipped and jumps to the next line. 在next
语句使所有剩余的命令被跳过,并跳转到下一行。
c="";delete d;delete e
If we have gotten here, that means that we are working on the second file, file2.txt
. 如果file2.txt
这里,则意味着我们正在处理第二个文件file2.txt
。 These three commands initialize variable c
and arrays d
and e
for the new line. 这三个命令为新行初始化变量c
以及数组d
和e
。
split($2, b, /;/)
This splits the second field on semicolons and assigns the result to array b
. 这将在分号上拆分第二个字段,并将结果分配给数组b
。
for (i in b)c=c";"(a[b[i]]?a[b[i]]:b[i])
This creates the uncompressed output. 这将创建未压缩的输出。
split(substr(c,2),d,/;/); for(i in d)e[d[i]]=1
This creates a associative array e
whose keys are each of the fields in the uncompressed output. 这将创建一个关联数组e
其键是未压缩输出中的每个字段。
c=""
This initializes c
again to an empty line before we add to it the compressed output. 在将压缩输出添加到c
之前,这将再次将c
初始化为空行。
for (i in e)c=c";"i
For each key in array e
, we add the key to string c
. 对于数组e
每个键,我们将键添加到字符串c
。 This creates the compressed output. 这将创建压缩的输出。
print $1,substr(c,2)
This prints the complete compressed line. 这将打印完整的压缩行。
Reasonably short awk way 合理的awk方式
awk 'FNR==NR{a[$1]=$2;next}
{for(i in a)gsub(i,a[i])
x=split($2,b,";")
for(i=1;i<=x;i++)y!~b[i]";"&&y=y?y";"b[i]:b[i];$2=y;y=""}1' file file2
ID200000000419 ITEM1;ITEM3;ITEM2;ITEM4;ID100000394921
ID200000000938 ITEM5;ITEM2;ITEM3;ITEM1;ITEM7;ID100000363117;ITEM8
ID200000001036 ITEM1;ITEM2;ITEM3;ITEM9;ID100000398119
FNR==NR{a[$1]=$2;next}
When the File Record Number matches total Record Number(effectively means whilst reading the first file) assign the second field to an array using the first field as a key. 当文件记录号与总记录号匹配时(有效地意味着在读取第一个文件的同时),使用第一个字段作为键将第二个字段分配给数组。 Next
means skip all further instructions and go to the next record. Next
意味着跳过所有进一步的说明,然后转到下一条记录。
for(i in a)gsub(i,a[i])
Now we are in the second file as FNR!=NR anymore. 现在,我们以FNR!= NR的形式进入第二个文件。
For each element in the array gsub
swaps everything that matches the key with what is contained in the array. 对于数组中的每个元素, gsub
都会将与键匹配的所有内容与数组中包含的内容进行交换。
x=split($2,b,";")
Split the second field into array b separated by a ;
将第二个字段拆分为数组b,并用a分隔;
. 。
Assign the size of the array to x. 将数组的大小分配给x。
for(i=1;i<=x;i++)
Loop from to the size of the array. 从循环到数组的大小。
y!~b[i]";"&&
if variable y already contains the split value in b then don't continue. 如果变量y已经包含b中的分割值,则不要继续。
y=y?y";"b[i]:b[i]
if y exists add the value in b[i] to the end or else just set y to b[i]. 如果y存在,则将b [i]中的值添加到末尾,否则只需将y设置为b [i]。
$2=y;y=""
Set second field to value in y(our new string) and reset y to nothing. 将第二个字段设置为y中的值(我们的新字符串),并将y重置为空。
https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.