简体   繁体   中英

Adding line number and file name to 3 column files using awk

I have a couple of data sets containg x, y, z coordinates in plain text format (no column headers, whitespace delimited, crlf line breaks). Data looks like this:

10168522 21059480 -86
10169988 21058886 -86
10171457 21058291 -86
10172926 21057706 -86
10174428 21057114 -85
10175927 21056531 -85
10177434 21055952 -85
10178966 21055370 -84
10180473 21054773 -85
10181992 21054164 -85
10183517 21053557 -85

In this example, the filename is "fileA.xyz". I'd like to add the line number and the filename to the file and print it out using awk (actually gawk 4.0.2 with no option of upgrading / installing additional tools). I have come up with the following

awk -F ' ' -v OFS=' ' -v ORS=' ' '{$1; $2; $3; $(NF+1)=++i FS "fileA.xyz"}1' better_fileA.xyz

which kind of works, but leaves the first line untouched:

10168522 21059480 -86
 1 fileA.xyz 10169988 21058886 -86
 2 fileA.xyz 10171457 21058291 -86
 3 fileA.xyz 10172926 21057706 -86
 4 fileA.xyz 10174428 21057114 -85
 5 fileA.xyz 10175927 21056531 -85
 6 fileA.xyz 10177434 21055952 -85
 7 fileA.xyz 10178966 21055370 -84
 8 fileA.xyz 10180473 21054773 -85
 9 fileA.xyz 10181992 21054164 -85
 10 fileA.xyz 10183517 21053557 -85

I've also noticed an extra white space in front of the line number (first column). I do understand awk is very complex, but I am a bit lost amidst the syntax options. For starters, I'm wondering why the order of columns I provided is apparently not passed on to the output?

Since all files are very large (couple of gigs), I'd like to use awk / sed. Also note that files need to be consistent, hence solutions involving cat -n are not really an option (due to the padding of line numbers in the file, which is not reasonable in this scenario where the range of line numbers is not known a priori and also because this would not take care of the filename).

Any suggestions or pointers towards a solution would be very welcome!

If the filename is hard-coded, this short one-liner should help:

$ awk '$0=NR FS "fileA.xyz" FS $0' YourFile

Basically, it prepends the stuff (lineNo# and the filename) to each record, and prints to stdOut.

$ awk '{printf "%3d %s %s\n", NR, "fileA.xyz", $0}' better_fileA.xyz 
  1 fileA.xyz 10168522 21059480 -86
  2 fileA.xyz 10169988 21058886 -86
  3 fileA.xyz 10171457 21058291 -86
  4 fileA.xyz 10172926 21057706 -86
  5 fileA.xyz 10174428 21057114 -85
  6 fileA.xyz 10175927 21056531 -85
  7 fileA.xyz 10177434 21055952 -85
  8 fileA.xyz 10178966 21055370 -84
  9 fileA.xyz 10180473 21054773 -85
 10 fileA.xyz 10181992 21054164 -85
 11 fileA.xyz 10183517 21053557 -85

The first line in your input appears to be getting skipped (but isn't really) because your input file has DOS line endings ( <CR><LF> ) so those CRs are causing the text you add to 1 line to appear at the start of the next line instead of at the end of the current line. See Why does my tool output overwrite itself and how do I fix it? .

Regarding your script:

  1. -F ' ' -v OFS=' ' doesn't make sense as you're setting those to the default values they already have.
  2. -v ORS=' ' doesn't make sense as you don't want your output crammed all onto 1 line.
  3. $1; $2; $3; doesn't make sense as it doesn't do anything except reference the field for no reason.
  4. $(NF+1)=++i doesn't make sense as i will then always have the value of NR .
  5. Populating/modifying a field (eg $(NF+1) ) or $0 unnecessarily is inefficient since it then causes awk to re-construct and/or re-split $0 .

It's also not clear from your code (which is trying to append fields) and your current output (which shows prepended fields that you seem happy with), what you really are trying to do - append or prepend fields. Also clarify if "fileA.xyz" you're adding to the output can be derived from the input file name better_fileA.xyz

One guess at what you might want is:

$ awk '{sub(/\r$/,"")} {print $0, NR, "fileA.xyz"}' better_fileA.xyz
10168522 21059480 -86 1 fileA.xyz
10169988 21058886 -86 2 fileA.xyz
10171457 21058291 -86 3 fileA.xyz
10172926 21057706 -86 4 fileA.xyz
10174428 21057114 -85 5 fileA.xyz
10175927 21056531 -85 6 fileA.xyz
10177434 21055952 -85 7 fileA.xyz
10178966 21055370 -84 8 fileA.xyz
10180473 21054773 -85 9 fileA.xyz
10181992 21054164 -85 10 fileA.xyz
10183517 21053557 -85 11 fileA.xyz

or maybe:

$ awk '{sub(/\r$/,"")} NR==1{fname=FILENAME; sub(/[^_]*_/,"",fname)} {print $0, NR, fname}' better_fileA.xyz
10168522 21059480 -86 1 fileA.xyz
10169988 21058886 -86 2 fileA.xyz
10171457 21058291 -86 3 fileA.xyz
10172926 21057706 -86 4 fileA.xyz
10174428 21057114 -85 5 fileA.xyz
10175927 21056531 -85 6 fileA.xyz
10177434 21055952 -85 7 fileA.xyz
10178966 21055370 -84 8 fileA.xyz
10180473 21054773 -85 9 fileA.xyz
10181992 21054164 -85 10 fileA.xyz
10183517 21053557 -85 11 fileA.xyz

or maybe:

$ awk '{sub(/\r$/,"")} NR==1{fname=FILENAME; sub(/[^_]*_/,"",fname)} {print NR, fname, $0}' better_fileA.xyz
1 fileA.xyz 10168522 21059480 -86
2 fileA.xyz 10169988 21058886 -86
3 fileA.xyz 10171457 21058291 -86
4 fileA.xyz 10172926 21057706 -86
5 fileA.xyz 10174428 21057114 -85
6 fileA.xyz 10175927 21056531 -85
7 fileA.xyz 10177434 21055952 -85
8 fileA.xyz 10178966 21055370 -84
9 fileA.xyz 10180473 21054773 -85
10 fileA.xyz 10181992 21054164 -85
11 fileA.xyz 10183517 21053557 -85

Get the book Effective AWK Programming, 5th Edition, by Arnold Robbins.

如果您不需要将行号放在完美排列的列中:

mawk 'NF~FNR{__=_ FILENAME _} $!NF=NR __$_' RS='\r?\n' FS='^$' \_=' ' better_fileA.xyz

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM