简体   繁体   中英

Using awk to print a new column without apostrophes or spaces

I'm processing a text file and adding a column composed of certain components of other columns. A new requirement to remove spaces and apostrophes was requested and I'm not sure the most efficient way to accomplish this task.

The file's content can be created by the following script:

content=(
  john    smith          thomas       blank    123    123456    10  
  jane    smith          elizabeth    blank    456    456123    12  
  erin    "o'brien"      margaret     blank    789    789123    9  
  juan    "de la cruz"   carlos       blank    1011   378943    4
)
# put this into a tab-separated file, with the syntactic (double) quotes above removed
printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\n' "${content[@]}" >infile

This is what I have now, but it fails to remove spaces and apostrophes:

awk -F "\t" '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$6 tolower(substr($2,0,3)); }' infile > outfile

This throws an error "sub third parameter is not a changeable object", which makes sense since I'm trying to process output instead of input, I guess.

awk -F "\t" '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$6 sub("'\''", "",tolower(substr($2,0,3))); }' infile > outfile

Is there a way I can print a combination of column 6 and part of column 2 in lower case, all while removing spaces and apostrophes from the output to the new column? Worst case scenario, I can just create a new file with my first command and process that output with a new awk command, but I'd like to do it in one pass is possible.

The second approach was close, but for order of operations:

awk -F "\t" '
  BEGIN { OFS="\t"; }
  {
    var=$2;
    sub("['\''[:space:]]", "", var);
    var=substr(var, 0, 3);
    print $1,$2,$3,$5,$6,$7,$6 var;
  }
'
  • Assigning the contents you want to modify to a variable lets that variable be modified in-place.
  • Characters you want to remove should be removed before taking the substring, since otherwise you shorten your 3-character substring.

It's a guess since you didn't provide the expected output but is this what you're trying to do?

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    abbr = $2
    gsub(/[\047[:space:]]/,"",abbr)
    abbr = tolower(substr(abbr,1,3))
    print $1,$2,$3,$5,$6,$7,$6 abbr
}

$ awk -f tst.awk infile
john    smith   thomas  123     123456  10      123456smi
jane    smith   elizabeth       456     456123  12      456123smi
erin    o'brien margaret        789     789123  9       789123obr
juan    de la cruz      carlos  1011    378943  4       378943del

Note that the way to represent a ' in a ' -enclosed awk script is with the octal \\047 (which will continue to work if/when you move your script to a file, unlike if you relied on "'\\''" which only works from the command line), and that strings, arrays, and fields in awk start at 1, not 0, so your substr(..,0,3) is wrong and awk is treating the invalid start position of 0 as if you had used the first valid start position which is 1 .

The "sub third parameter is not a changeable object" error you were getting is because sub() modifies the object you call it with as the 3rd argument but you're calling it with a literal string (the output of tolower(substr(...)) ) and you can't modify a literal string - try sub(/o/,"","foo") and you'll get the same error vs if you used var="foo"; sub(/o/,"",var) var="foo"; sub(/o/,"",var) which is valid since you can modify the content of variables.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM