简体   繁体   中英

compare the length of multiple files using awk or sed

I want to compare the number of lines of each file and choose the one that contains the maximum number of lines for example

filename_V1 |wc -l =100
filename_V2 |wc -l =19
filename_V3  |wc -l =10

myFile_V1  |wc -l =1
myFile_V2  |wc -l =10
myFile_V3  |wc -l =15

I will get as result

filename_V1
myFile_V3

Here's one that will consider files grouped by their base parts (what comes before the _Vn ) and print the ones with the most lines from each group.

EDIT : Just to point out that the awk script is not suitable if some file names include whitespace (it assumes the second field in the wc output is the entire file name).

$ cat bf.awk
$2 ~ /_V[0-9]+/ {
    lines = $1;
    file = $2;
    base = file;
    sub("_.*", "", base);
    if (lines > max[base]) {
        max[base] = lines;
        best[base] = file;
    }
}

END { for (base in best) print best[base] }


$ wc -l *_V*
       3 a_V1
       1 a_V2
       4 a_V3
       4 b_V1
       3 b_V2
       1 b_V3
       2 b_V4
      18 total


$ wc -l *_V* | awk -f bf.awk
a_V3
b_V1
wc -l filename_V1 filename_V2 filename_V3 myFile_V1 myFile_V2 myFile_V3 | \
sort -rg

"count lines for each file " | "sort them by number" | "inverse order"

It also prints the total (which should be the maximum number), but this should be enough for you.

This is a one-liner. I used the \\ separator to make it more clear

Mixes:

command_that_spits_files | xargs wc -l | sort -rg
find . -name 'filename_V[0-9]\{1,2\}' | xargs -L1 wc -l | sort -rg | cut -d ' ' -f 2

You can use awk to find the max if you don't want sort since that will also report the number of lines, not just the name of the longest file (by line count).

wc -l filename_v1 filename_v2 filename_v3 | awk '$2 != "total" {if($1 > max_val) {max_val=$1; max_file=$2}}0; END{print max_file}'

so we'll do a wc -l to get the counts of the number of lines in whatever set of files we're interested in, then in awk we'll keep track of the biggest number we saw by looking at the first value and storing it, then at the end print out just the filename associated with the max number of lines we saw.

And for good measure we won't count the "total" line

The safest way to find all the files and do this would be to do (with GNU wc):

find -type f -name '*_V*' -print0 | wc -l --files0-from=- | awk '$2 != "total" {if($1 > max_val) {max_val=$1; max_file=$2}}0; END{print max_file}'

or without GNU's wc:

find -type f -name '*_V*' -print0 | xargs -0 wc -l | awk '$2 != "total" {if($1 > max_val) {max_val=$1; max_file=$2}}0; END{print max_file}'

and use the appropriate file glob for -name in find. Also, if you don't want to look at subdirectories add -maxdepth 1

LARGESTFILE=;MAXLINECOUNT=0;for file in *; do CURRENTCOUNT=$(wc -l <"$file"); if [ "$CURRENTCOUNT" -gt "$MAXLINECOUNT" ]; then LARGESTFILE=$file; MAXLINECOUNT=$CURRENTCOUNT; fi; done; echo $LARGESTFILE

Another "one liner":

# generate a tab seperated table: name|basename|lines
for f in *_V[0-9]*;do printf "$f\t${f%V*}\t%d\n" $(wc -l < "$f");done |\
    sort -t$'\t' -k3rn    |\ # sort rows by line number descending
    sort -t$'\t' -u -k2,2 |\ # take rows with unique basename in sorted order
    cut -f1                  # take name column

This assumes there are no tab or newline characters in your file names, and uses a number of bashisms ( $ escape characters, string manipulation). Unlike most answers here, it doesn't choke on spaces or multiple occurrences of _ in a file name.


More efficient version

The original was robust, but generating the table invoked printf and wc N times. This one's a bit uglier, but much faster (200x on my machine):

# table is now basename|lines|name
printf "%s\n" *_V[0-9]*           |\ # print every file on a new line
    rev | cut -d_ -f2- | rev      |\ # extract the base name (faster than sed)
    paste - \                        # combine base name and wc output
        <(wc -l *_V[0-9]* |\
        sed 's/^ *//;s/ /\t/;$d') |\ # tabulate wc output
    sort -t$'\t' -k2rn,2          |\ # sort as above
    sort -t$'\t' -u -k1,1         |\
    cut -f3

Alternative with lot's of piping

wc -l *_V* | \             # generate the list
sed 's/_V/ _V/;$d' | \     # separate baseline from versions, delete total
sort -k 2,2 -k 1,1nr | \   # sort by names and size (reverse)
sort -k 2,2 -u | \         # get the first by name (max by design)
sed 's/ _V/_V/' | \        # reverse baseline name back to original
awk '{print $2}'           # extract the filename

this script assumes the file names are under your control and space or _V won't appear in the base names. Otherwise check out @Qualia's version.

With GNU awk for BEGINFILE:

awk '
BEGINFILE { base=FILENAME; sub(/_[^_]+$/,"",base); fname[base]; max=0 }
FNR > max { max=FNR; fname[base]=FILENAME }
END { for (base in fname) print fname[base] }
' *

You can approximate that in non-gawk with FNR==1 instead of BEGINFILE but then you'd need extra code to handle the case where all files with a given base are empty if that's possible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM