简体   繁体   中英

Detect if a series of numbers is sequential in bash/awk

So I have a series of scripts that generate intermediary text files along the way as a means of storing information across different scripts. Essentially the scripts detect rows within data that have been approved by the user for removal. The line numbers that are to be removed from the source file are stored in a file.

For example, say I have a source data file like this:

    a1,b1,c1,d1
    a2,b2,c2,d2
    a3,b3,c3,d3
    a4,b4,c4,d4
    a5,b5,c5,d5
    a6,b6,c6,d6
    a7,b7,c7,d7

And the intermediary file would contain something like this:

    1 3 4 5 6

Which would result, when the script is run, in an output data file as follows:

    a2,b2,c2,d2
    a7,b7,c7,d7

This all works fine, there is nothing to fix in this code. The problem is, when I'm dealing with actual data files sometimes there are literally thousands of numbers stored in the intermediary file for removal. This means I can't use a loop, because it will take a massive amount of time, and my current method of using sed gets overloaded with a error: too many arguments . Many of the line numbers are consecutive, so here's where I get to my question:

Is there a way in bash or awk to detect whether a series of space-separated numbers are consecutive?

I can sort out everything beyond that, I'm just stumped on how I could do this in one/a series of step(s). My plan, if I can detect consecutive values, is to change the intermediary file from:

    1 3 4 5 6

To:

    1 3-6

And then I'll be able to write code that will run on each range of values in a more manageable way.

If possible I'd like to avoid looping through each value and checking individually whether or not it's one step above the previous value, since I'm dealing with tens of thousands of numbers in a list.

If this is not possible in bash/awk, is there another way to accomplish this task to reduce the overall number of arguments passed to my script and greatly reduce the chances of encountering an error for too many arguments?

What about this?

BEGIN {
    getline < "intermediate.txt"
    split($0, skippedlines, " ")
    skipindex = 1
}
{
    if (skippedlines[skipindex] == NR)
        ++skipindex;
    else
        print
}

Use cat , join , and cut :

Files infile and ids :

a1,b1,c1,d1         1
a2,b2,c2,d2         3
a3,b3,c3,d3         4
a4,b4,c4,d4         5
a5,b5,c5,d5         6
a6,b6,c6,d6
a7,b7,c7,d7

Removal of selected lines:

$ join -v 2 ids <(cat -n infile) | cut -f 2 -d ' '
a2,b2,c2,d2
a7,b7,c7,d7

What's going on:

  • First, the initial file receives an id on each line, with cat -n infile ;
  • then, the resulting file is joined on the first column with the file holding the ids;
  • only non-matching lines from second file are printed -- join -v 2 ;
  • the first column, with the ids, is removed;
  • and, it's a neat shell one-liner (:

In case your file with ids is written as an unique line, you can still make use of the above one-liner, simply adding a translation on the file with ids, as follows:

$ join -v 2 <(tr ' ' '\n' ids) <(cat -n infile) | cut -f 2 -d ' '

@jmihalicza's answer nicely uses awk to solve the whole problem of selecting the lines from source file that match those in the intermediate file. For completeness, the following awk program reduces the list of individual line numbers to ranges, where possible, which I think answers the original question:

    { for (j = 1; j <= NF; j++) {
        lin[i++] = $j;
        }
    }

END {
    start = lin[0];
    j = 1;
    while (j <= i) {
        end = start
        while (lin[j] == (lin[j-1]+1)) {
            end = lin[j++];
            }
        if ((end+0) > (start+0)) {
                printf "%d-%d ",start,end
            } else {
                printf "%d ",start
            }
        start = lin[j++];
        }
    }

Given this script, which I've called merge.awk and a file testlin.txt as follows:

1 3 4 5 6 9 10 11 13 15

... we can do this:

$ awk -f merge.awk <testlin.txt
1 3-6 9-11 13 15

This might work for you (GNU sed):

sed -r 's/\S+/&d/g;s/\s+/\n/g' intermediate_file | sed -f - source_file

Change the intermediate file into a sed script.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM