简体   繁体   中英

improve bash loop with awk split

The awk below improved by @hek2mgl runs, however it takes ~15 hours to complete. It is basically matching input files that are 21 - 259 records to a file of 11,137,660 records. It is a lot but hopefully it can be made faster. Maybe If spilt $5 on the hyphen AGRN-6|gc=75 to AGRN - 6|gc=75 could speed up the process. Not sure if the below is a start or not. Essentially what it does is use the input files of which there are 4 to search and match in a large 11,000,000 record file. Thank you :).

input

AGRN
CCDC39 
CCDC40 
CFTR

file that is searched in

chr1    955543  955763  chr1:955543 AGRN-6|gc=75    1   0
chr1    955543  955763  chr1:955543 AGRN-6|gc=75    2   2
chr1    955543  955763  chr1:955543 AGRN-6|gc=75    3   2

output ( $4 $5 average of $7 )

chr1:955543 AGRN-6|gc=75 1.3

awk

BEGIN{FS="[\t| -]+"}

# Read search terms from file1 into 's'
FNR==NR {
s[$0=1]
next
}
{

# Check if $5 matches one of the search terms
for(i in s) {
    if($5 ~ i) {

# check for match
  if s[$5] exists 
  s[$5] {

        # Store first two fields for later usage
        a[$5]=$1
        b[$5]=$2

        # Add $9 to total of $9 per $5
        t[$5]+=$8
        # Increment count of occurences of $5
        c[$5]++

        next
    }
  }
  }
  END {

# Calculate average and print output for all search terms
# that has been found
for( i in t ) {
    avg = t[i] / c[i]
    printf "%s:%s\t%s\t%s\n", a[i], b[i], i, avg | "sort -k3,3n"
}
}

Simplify:

awk '
    NR == FNR {input[$0]; next}
    {
        split($5, a, "-")
        if (a[1] in input) {
            key = $4 OFS $5
            n[key]++
            sum[key] += $7
        }
    }
    END {
        for (key in n) 
            printf "%s %.1f\n", key, sum[key]/n[key]
    }
' input file

Your code is broken because of the over-use of arrays, but mainly this:

FNR==NR {
s[$0=1]
# ^^^^^
next
}

Array s will only have a single key, the number "1" because for each line you assign the value "1" to $0 . You should write

s[$0] = 1

I'd be interested to hear what the speed is of the following, I'm not sure it will be much slower since it doesn't require awk to do anything clumsy but it still requires the number of input selection passes to complete. If you want to optimize it I think you need to use associative arrays and hash the input selection match to its own array. That way you can have it done in one pass over the file - though still the same amount of potential passes per line unless you can skip searching after the first match you may be slightly quicker.

Input file: select.txt
Search file: search_file.txt

while IFS= read a; do awk "BEGIN {cnt=0;var=0}{ if (\\$5~ \\"${a}\\") { var=var+\\$7;field4=\\$4; cnt+=1; field5=\\$5; }; } END{print field4\\" \\"field5\\" \\"var/cnt}" search_file.txt done < select.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM