I want to compare the number of lines of each file and choose the one that contains the maximum number of lines for example
filename_V1 |wc -l =100
filename_V2 |wc -l =19
filename_V3 |wc -l =10
myFile_V1 |wc -l =1
myFile_V2 |wc -l =10
myFile_V3 |wc -l =15
I will get as result
filename_V1
myFile_V3
Here's one that will consider files grouped by their base parts (what comes before the _Vn
) and print the ones with the most lines from each group.
EDIT : Just to point out that the awk script is not suitable if some file names include whitespace (it assumes the second field in the wc
output is the entire file name).
$ cat bf.awk
$2 ~ /_V[0-9]+/ {
lines = $1;
file = $2;
base = file;
sub("_.*", "", base);
if (lines > max[base]) {
max[base] = lines;
best[base] = file;
}
}
END { for (base in best) print best[base] }
$ wc -l *_V*
3 a_V1
1 a_V2
4 a_V3
4 b_V1
3 b_V2
1 b_V3
2 b_V4
18 total
$ wc -l *_V* | awk -f bf.awk
a_V3
b_V1
wc -l filename_V1 filename_V2 filename_V3 myFile_V1 myFile_V2 myFile_V3 | \
sort -rg
"count lines for each file " |
"sort them by number" |
"inverse order"
It also prints the total (which should be the maximum number), but this should be enough for you.
This is a one-liner. I used the \\
separator to make it more clear
Mixes:
command_that_spits_files | xargs wc -l | sort -rg
find . -name 'filename_V[0-9]\{1,2\}' | xargs -L1 wc -l | sort -rg | cut -d ' ' -f 2
You can use awk to find the max if you don't want sort
since that will also report the number of lines, not just the name of the longest file (by line count).
wc -l filename_v1 filename_v2 filename_v3 | awk '$2 != "total" {if($1 > max_val) {max_val=$1; max_file=$2}}0; END{print max_file}'
so we'll do a wc -l
to get the counts of the number of lines in whatever set of files we're interested in, then in awk we'll keep track of the biggest number we saw by looking at the first value and storing it, then at the end print out just the filename associated with the max number of lines we saw.
And for good measure we won't count the "total" line
The safest way to find all the files and do this would be to do (with GNU wc):
find -type f -name '*_V*' -print0 | wc -l --files0-from=- | awk '$2 != "total" {if($1 > max_val) {max_val=$1; max_file=$2}}0; END{print max_file}'
or without GNU's wc:
find -type f -name '*_V*' -print0 | xargs -0 wc -l | awk '$2 != "total" {if($1 > max_val) {max_val=$1; max_file=$2}}0; END{print max_file}'
and use the appropriate file glob for -name
in find. Also, if you don't want to look at subdirectories add -maxdepth 1
LARGESTFILE=;MAXLINECOUNT=0;for file in *; do CURRENTCOUNT=$(wc -l <"$file"); if [ "$CURRENTCOUNT" -gt "$MAXLINECOUNT" ]; then LARGESTFILE=$file; MAXLINECOUNT=$CURRENTCOUNT; fi; done; echo $LARGESTFILE
Another "one liner":
# generate a tab seperated table: name|basename|lines
for f in *_V[0-9]*;do printf "$f\t${f%V*}\t%d\n" $(wc -l < "$f");done |\
sort -t$'\t' -k3rn |\ # sort rows by line number descending
sort -t$'\t' -u -k2,2 |\ # take rows with unique basename in sorted order
cut -f1 # take name column
This assumes there are no tab or newline characters in your file names, and uses a number of bashisms ( $
escape characters, string manipulation). Unlike most answers here, it doesn't choke on spaces or multiple occurrences of _
in a file name.
The original was robust, but generating the table invoked printf
and wc
N times. This one's a bit uglier, but much faster (200x on my machine):
# table is now basename|lines|name
printf "%s\n" *_V[0-9]* |\ # print every file on a new line
rev | cut -d_ -f2- | rev |\ # extract the base name (faster than sed)
paste - \ # combine base name and wc output
<(wc -l *_V[0-9]* |\
sed 's/^ *//;s/ /\t/;$d') |\ # tabulate wc output
sort -t$'\t' -k2rn,2 |\ # sort as above
sort -t$'\t' -u -k1,1 |\
cut -f3
Alternative with lot's of piping
wc -l *_V* | \ # generate the list
sed 's/_V/ _V/;$d' | \ # separate baseline from versions, delete total
sort -k 2,2 -k 1,1nr | \ # sort by names and size (reverse)
sort -k 2,2 -u | \ # get the first by name (max by design)
sed 's/ _V/_V/' | \ # reverse baseline name back to original
awk '{print $2}' # extract the filename
this script assumes the file names are under your control and space or _V won't appear in the base names. Otherwise check out @Qualia's version.
With GNU awk for BEGINFILE:
awk '
BEGINFILE { base=FILENAME; sub(/_[^_]+$/,"",base); fname[base]; max=0 }
FNR > max { max=FNR; fname[base]=FILENAME }
END { for (base in fname) print fname[base] }
' *
You can approximate that in non-gawk with FNR==1
instead of BEGINFILE
but then you'd need extra code to handle the case where all files with a given base are empty if that's possible.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.