I have a file:
To jest długi string z wieloma polskimi literami ąółżęś kodowany w UTF8,
żeby
było śmieszniej, haha.
ą
a
Example gawk:
gawk '{printf "%-80s %-s\n", $0, length}' file
In gawk, I get the correct result:
To jest długi string z wieloma polskimi literami ąółżęś kodowany w UTF8, 73
żeby 5
było śmieszniej, haha. 22
ą 1
a 1
In gawk, I get the correct result:
Example mawk:
mawk '{printf "%-80s %-s\n", $0, length}' file
To jest długi string z wieloma polskimi literami ąółżęś kodowany w UTF8, 80
żeby 6
było śmieszniej, haha. 24
ą 2
a 1
In mawk, I get the incorrect result:
As mawk get the same result as gawk?
mawk is a minimal-featured awk designed for speed of execution over functionality. You should not expect it to behave exactly the same as gawk or a POSIX awk. If you're going to use mawk, you need to get a mawk manual describing how IT behaves, don't rely on any other documentation describing how other awks behave.
IMHO there is no correct result for the formatting string %-s
as it is meaningless to align a string without specifying a width within which to align it. There's also different interpretations of what length
means on it's own - it could be short-hand for length($0)
or it could be something else in a non-POSIX awk, there might not even be a length function in some non-POSIX awk and so it might take that as an undefined variable name. How does any given awk handle non-English characters?
As I said - if you're going to use a non-POSIX awk, you need to check the manual for THAT awk for all of the gory details...
I assume you are using different systems... because awk installation on a system uses to be a symlink to either gawk or mawk.
All awk versions are compatible as long as the versions coincide.
I therefore assume that the issue you are facing is due to the use of an older and a newer version of the programs.
UPDATE 1 : realized i could massively streamline it -
the only thing one needs is to pad back the count of UTF-8
continuation bytes into the total width, and by defining FS
as such, then the count will always be NF - 1
for non-empty lines, and the count at the tail end of the line reflects the UTF-8 character count
(strictly speaking… it's a code-point count)
caveat : this code takes the leap of faith and assumes input is valid UTF-8
to begin with, w/o performing data validation checks
=
mawk[1/2]|gawk -b '
$!NF = sprintf("%-*s %s",(__=NF-!_)+80,$_,length($_)-__)' FS='[\\200-\\277]'
=
To jest długi string z wieloma polskimi literami ąółżęś kodowany w UTF8, 73
żeby 5
było śmieszniej, haha. 22
ą 1
a 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.