简体   繁体   English

awk根据字符串字符比较将两个文件合并为两列

[英]awk merged two files with 2 columns based on string character comparison

I am a beginner and my work starts to become difficult for me. 我是一个初学者,我的工作变得越来越困难。 I explain my problem. 我解释我的问题。 I have two tables File1 and File2 (reference table). 我有两个表File1和File2(参考表)。

File1
num, Name
1, 1_1_busteni
13, 23_Doicesti
40, 2_AR_Moreni
47, 2_AR_Moreni_SUD
55, Petrolul_Romanesc
62, castor

File2
ID_ref, Name_ref
R_001,  BUSTENI
R_002,  DOICESTI-23
R_003,  MORENI
R_004,  MORENI-SUD
R_005,  ROMANESC
R_006,  CASTOR

File3
num, Name,ID_ref,Name_ref
1, 1_1_busteni, R_001, BUSTENI
13, 23_Doicesti, R_002, DOICESTI-23
40, 2_AR_Moreni, R_003, MORENI
47, 2_AR_Moreni_SUD, R_004, MORENI-SUD
55, Petrolul_Romanesc, R_005, ROMANESC
62, castor, R_006, CASTOR

I don't have any identical column but I have some similarity between &2File1 and &2File2. 我没有相同的列,但是&2File1和&2File2之间有一些相似之处。 File1 is from user and we want to standardize everything so I have a lot of different cases. File1来自用户,我们想对所有内容进行标准化,因此我有很多不同的情况。 I don´t know how to start. 我不知道如何开始。 My idea was to remove all the “_” in my first file and “-“ in my second and compare them. 我的想法是删除第一个文件中的所有“ _”和第二个文件中的“-”,然后进行比较。 I managed to do it with 我设法做到了

awk 'BEGIN {FS=OFS=","} {gsub(/_/,"",$2)}1' file1.txt and awk 'BEGIN {FS=OFS=","} {gsub(/-/,"",$2)}1’ file2.txt

separately but I don't know how to combine and compare my two files. 分别,但是我不知道如何合并和比较两个文件。

I know also I have to think about lowercase. 我知道我也必须考虑小写字母。 A nice guy give me this code above: It works for CASTOR 一个好人,给我上面的这段代码 :它适用于CASTOR
but How can I associate it with my gsub ??? 但是如何将其与我的gsub关联?

$ awk '
BEGIN { FS=OFS="," }
NR==FNR {                                                  
    a[tolower($2)]=$0                                      
    next
}
{                                                          
    split($2,b,"[^[:alpha:]]")                             
    print $0 (tolower(b[1]) in a?OFS a[tolower(b[1])]:"")  
}' file2 file1 

Maybe it exists a better way, I am open !!! 也许存在更好的方法,我很开放!

Here is one shot at it in awk: 这是awk的一张照片:

$ awk 'BEGIN { FS=", *"; OFS="," }
NR==FNR {
    a[tolower($2)]=$0
    next
}
{
    for(i in a)               # for every city in file2
        if(tolower($2)~i) {   # compare it to a record from file1
            print $0,a[i]     # print it if there is a match
            next
        }
}1' file2 file1
num, Name
1, 1_1_busteni,R_001,  BUSTENI
13, 23_Doicesti
40, 2_AR_Moreni,R_003,  MORENI
47, 2_AR_Moreni_SUD,R_003,  MORENI
55, Petrolul_Romanesc,R_005,  ROMANESC
62, castor,R_006,  CASTOR

Any better than that would require rules on processing the underscores and dashes in the names or approximate pattern matching with approriate algorithms (see for example Levenshtein distance ). 最好的方法是使用规则处理名称中的下划线和短划线或使用适当的算法(例如,参见Levenshtein distance匹配近似模式

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM