比较两个不同文件中的列并产生三个输出

Question

我有多个配对文件，标题为 xxx_1.txt 和 xxx_2.txt、yyy_1.txt 和 yyy_2.txt 等。它们是具有以下格式的单列文件：

xxx_1.txt：

#CHROM_POSREFALT
MSHR1153_annotated_1_9107CA
MSHR1153_annotated_1_9197CT
MSHR1153_annotated_1_9303TC
MSHR1153_annotated_1_10635GA
MSHR1153_annotated_1_10836AG
MSHR1153_annotated_1_11108AG
MSHR1153_annotated_1_11121GA
MSHR1153_annotated_1_11123CT
MSHR1153_annotated_1_11131CT
MSHR1153_annotated_1_11155AG
MSHR1153_annotated_1_11166CT
MSHR1153_annotated_1_11186TC
MSHR1153_annotated_1_11233TG
MSHR1153_annotated_1_11274GT
MSHR1153_annotated_1_11472CG
MSHR1153_annotated_1_11814GA
MSHR1153_annotated_1_11815CT

xxx_2.txt：

LocationMSHR1153_annotatedMSHR0491_Australasia
MSHR1153_annotated_1_56TC
MSHR1153_annotated_1_226AG
MSHR1153_annotated_1_670AG
MSHR1153_annotated_1_817CT
MSHR1153_annotated_1_1147TC
MSHR1153_annotated_1_1660TC
MSHR1153_annotated_1_2488AG
MSHR1153_annotated_1_2571GA
MSHR1153_annotated_1_2572TC
MSHR1153_annotated_1_2698TC
MSHR1153_annotated_1_2718TG
MSHR1153_annotated_1_3018TC
MSHR1153_annotated_1_3424TC
MSHR1153_annotated_1_3912CT
MSHR1153_annotated_1_4013GA
MSHR1153_annotated_1_4087GC
MSHR1153_annotated_1_4878CT
MSHR1153_annotated_1_5896GA
MSHR1153_annotated_1_7833TG
MSHR1153_annotated_1_7941CT
MSHR1153_annotated_1_8033GA
MSHR1153_annotated_1_8888AC
MSHR1153_annotated_1_9107CA
MSHR1153_annotated_1_9197CT

它们实际上比这长得多。 我的目标是两个比较每条线并产生多个输出，以便稍后创建维恩图。 所以我需要一个文件，它列出了所有共同的行，看起来像这样（在这种情况下只有一个）：

MSHR1153_annotated_1_9107CA

一个文件列出了特定于 xxx_1 的所有内容，一个文件列出了特定于 xxx_2 的所有内容。

到目前为止，我想出了这个：

awk ' FNR==NR { position[$1]=$1; next} {if ( $1 in position ) {print $1 > "foundinboth"} else {print $1 > "uniquetofile1"}} ' FILE2 FILE1

问题是我知道如何运行 300 多个配对文件，如果我使用它，我每次都必须手动更改它们。 它也不会同时生成所有文件。 有没有办法做到这一点来循环并自动更改所有内容？ 文件是配对的，所以末尾的后缀是不同的“ _1 ”和“ _2 ”。 我需要它遍历每个配对文件并同时生成我需要的所有内容。

Answer 1

请您尝试以下操作：

for f in *_1.txt; do                    # find files such as "xxx_1.txt"
    basename=${f%_*}                    # extract "xxx" portion
    if [[ -f ${basename}_2.txt ]]; then # make sure "xxx_2.txt" exists
        file1="${basename}_1.txt"       # assign bash variable file1
        file2="${basename}_2.txt"       # assign bash variable file2
        both="${basename}_foundinboth.txt"
        uniq1="${basename}_uniquetofile1.txt"
        uniq2="${basename}_uniquetofile2.txt"

        awk -v both="$both" -v uniq1="$uniq1" -v uniq2="$uniq2" '
                                        # pass the variables to AWK with -v option
            FNR==NR { b[$1]=$1; next }
            {
                if ($1 in b) {
                    print $1 > both
                    seen[$1]++          # mark if the line is found in file1
                } else {
                    print $1 > uniq1
                }
            }
            END {
                for (i in b) {
                    if (! seen[i]) {    # the line is not found in file1
                        print i > uniq2 # then it is unique to file2
                    }
                }
        }' "$file2" "$file1"
    fi
done

请注意*_uniquetofile2.txt行不保持原始顺序。 如果您需要它们，请尝试自己对它们进行分类或让我知道。

比较两个不同文件中的列并产生三个输出

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-10-06 02:27:58

比较两个不同文件中的列并产生三个输出

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-10-06 02:27:58

解决方案1
3 已采纳 2020-10-06 02:27:58