简体   繁体   中英

Remove quotes “…” from all columns with AWK

I've currently been trying to create an awk script that will take my.csv and convert it into a.tsv. I have managed to achieve this with the below script:

$ awk 'BEGIN { FS=","; OFS="\t" } {$1=$1; print}' file.csv > file.tsv

Example csv data

"visitor_date","country","browser","browser_version","id1","id2","id3","id4","true or false column"
"05/10/2019 00:55","United States","App","","692467","2163702","4117512086","format","0"
"05/10/2019 00:56","United States","SamsungBrowser","8","692467","2163702","4117512083","format","1"
"05/10/2019 01:08","United States","safari","11","692467","2163704","4117511299","format","1"
"05/10/2019 03:25","United States","SamsungBrowser","8","692467","2163704","4117511299","format",""
"05/10/2019 07:50","United States","chrome","58","692467","2163704","4117511299","format",""
"05/10/2019 09:18","United States","internet explorer","11","692467","2163702","4117512086","format","0"

However I've found that my data file tends to have all of the columns defined as a sting with the "", even when we have things that might be considered an integer.

My initial test was to try and remove the {","} from everything and then define things as strings on certain columns but I get an error when I try to quote it.

$ awk 'BEGIN { FS="",""; OFS="\t" } {$1=$1; print}' file.csv > file.tsv

Error

awk: cmd. line:1: BEGIN { FS="",""; OFS="\t" } {$1=$1; print}
awk: cmd. line:1:              ^ syntax error

From the research I've done online all I seem to find is how you can change a string value in a cell to a different integer value. Is there a way that I can transform certain columns (not cells) to be treated as "integers" and "dates" instead of all as strings? Is what I'm trying to do with my initial test even the right direction I should be taking or is there a different way with AWK to decide what my columns are?

Desired tsv Output

"visitor_date"  "country    browser"    "browser_version"   "client_transaction_id" "id1"   "id2"   "id3"   "id4"   "true or false column"
05/10/2019 00:55    "United States" "App"       null    692467  "2163702"   4117512086  "format"    0
05/10/2019 00:56    "United States" "SamsungBrowser"    8   "null"  692467  2163702 4117512083  "format"    1
05/10/2019 01:08    "United States" "safari"    11  null    "692467"    2163704 4117511299  "format"    1
05/10/2019 03:25    "United States" "SamsungBrowser"    8   "null"  692467  2163704 4117511299  "format"    
05/10/2019 07:50    "United States" "chrome"    58  null    "692467"    2163704 4117511299  "format"    
05/10/2019 09:18    "United States" "internet explorer" 11  "null"  692467  2163702 4117512086  "format"    0

Note: The "visitor_date" column would need to be a "date" format.

Apologies if this is something super obvious or basic. I've currently only got 2 days experience using an ubuntu interface on windows 10 and AWK so there is a lot I'm trying to take in.

If there are no double quotes you want to preserve, simply

awk 'BEGIN { FS=","; OFS="\t" } { gsub("\"", "") } { $1=$1 } 1' file.csv >file.tsv

Incidentally, this also demonstrates how to quote a literal double-quote character inside a double-quoted string; backslash-escape it.

If you only want to remove double quotes around some columns, you can limit the scope of the gsub . For example,

... { gsub("\"", "", $1); gsub("\"", "", $3) } 1

to only manipulate columns 1 and 3. Then you can drop the { $1 = $1 } because we are forcing Awk to recalculate its columns anyway (so the FS gets replaced with OFS ).

To not manipulate the first line (which commonly contains the header, not data), add a condition:

... NR>1 { gsub(...) }1

The "replace everywhere" can be expressed quite succinctly in sed , though it tends to be more of a write-only language.

sed -e 's/,/\t/g' -e 's/"//g' file.csv >file.tsv

Not all sed dialects interpret \t as a literal tab, and some are even allergic to multiple -e options; but this should work fine on Ubuntu and generally Linux (other than possibly Busybox and other similar attempts to relive the constraints of the jolly 1970s).

It's not awk, but it's very easy using Miller ( https://github.com/johnkerl/miller ). Running

mlr --c2t cat ./input.csv >./output.csv

You will have

visitor_date    country browser browser_version id1 id2 id3 id4 true or false column
05/10/2019 00:55    United States   App     692467  2163702 4117512086  format  0
05/10/2019 00:56    United States   SamsungBrowser  8   692467  2163702 4117512083  format  1
05/10/2019 01:08    United States   safari  11  692467  2163704 4117511299  format  1
05/10/2019 03:25    United States   SamsungBrowser  8   692467  2163704 4117511299  format  
05/10/2019 07:50    United States   chrome  58  692467  2163704 4117511299  format  
05/10/2019 09:18    United States   internet explorer   11  692467  2163702 4117512086  format  0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM