简体   繁体   中英

Slight error when using awk to remove spaces from a CSV column

I have used the following awk command on my bash script to delete spaces on the 26th column of my CSV;

awk 'BEGIN{FS=OFS="|"} {gsub(/ /,"",$26)}1' original.csv > final.csv

Out of 400 rows, I have about 5 random rows that this doesn't work on even if I rerun the script on final.csv. Can anyone assist me with a method to take care of this? Thank you in advance.

EDIT: Here is a sample of the 26th column on original.csv vs final.csv respectively;

2212026837                         2212026837
2256  41688  6                     2256416886
2076113566                         2076113566
2009  84517  7                     2009845177
2067950476                         2067950476
2057  90531  5                     2057  90531  5  
2085271676                         2085271676
2095183426                         2095183426
2347366235                         2347366235
2200160434                         2200160434
2229359595                         2229359595
2045373466                         2045373466
2053849895                         2053849895
2300  81552  3                     2300  81552  3

You can use the string function split , and iterate over the corresponding array to reassign the 26th field:

awk 'BEGIN{FS=OFS="|"} {
    n = split($26, a, /[[:space:]]+/)
    $26=a[1]
    for(i=2; i<=n; i++)
        $26=$26""a[i]
}1' original.csv > final.csv

I see two possibilities.

  1. The simplest is that you have some whitespace other than a space. You can fix that by using a more general regex in your gsub : instead of / / , use /[[:space:]]/ .

If that solves your problem, great! You got lucky, move on. :)

  1. The other possible problem is trickier. The CSV (or, in this case, pipe-SV) format is not as simple as it appears, since you can have quoted delimiters inside fields. This, for instance, is a perfectly valid 4-field line in a pipe-delimited file:

     field 1|"field 2 contains some |pipe| characters"|field 3|field 4 

    If the first 4 fields on a line in your file looked like that, your gsub on $26 would actually operate on $24 instead, leaving $26 alone. If you have data like that, the only real solution is to use a scripting language with an actual CSV parsing library. Perl has Text::CSV , but it's not installed by default; Python's csv module is, so you could use a program like so:

     import csv, fileinput as fi, re; for row in csv.reader(fi.input(), delimiter='|'): row[25] = re.sub(r'\\s+', '', row[25]) # fields start at 0 instead of 1 print '|'.join(row) 

    Save the above in a file like colfixer.py and run it with python colfixer.py original.csv >final.csv .

    (If you tried hard enough, you could get that shoved into a -c option string and run it from the command line without creating a script file, but Python's not really built for that and it gets ugly fast.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM