Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

Question

I have a bunch of daily printer logs in CSV format and I'm writing a script to keep track of how much paper is being used and save the info to a database, but I've come across a small problem

Essentially, some of the document names in the logs include commas in them (which are all enclosed within double quotes), and since it's in comma separated format, my code is messing up and pushing everything one column to the right for certain records.

From what I've been reading, it seems like the best way to go about fixing this would be using awk or sed , but I'm unsure which is the best option for my situation, and how exactly I'm supposed to implement it.

Here's a sample of my input data:

 2015-03-23 08:50:22,Jogn.Doe,1,1,Ineo 4000p,"MicrosoftWordDocument1",COMSYRWS14,A4,PCL6,,,NOT DUPLEX,GRAYSCALE,35kb,

And here's what I have so far:

#!/bin/bash

#Get today's file name
yearprefix="20"
currentdate=$(date +"%m-%d-%y");
year=${currentdate:6};
year="$yearprefix$year"
month=${currentdate:0:2};
day=${currentdate:3:2};
filename="papercut-print-log-$year-$month-$day.csv"
echo "The filename is: $filename"

# Remove commas in between quotes.

#Loop through CSV file

OLDIFS=$IFS
IFS=,
[ ! -f $filename ] && { echo "$Input file not found"; exit 99; }
while read time user pages copies printer document client size pcl blank1 blank2 duplex greyscale filesize blank3
do
        #Remove headers
        if [  "$user" != "" ] && [ "$user" != "User" ]
        then
                #Remove any file name with an apostrophe

                if [[ "$document" =~ "'" ]];
                then
                        document="REDACTED"; # Lazy. Need to figure out a proper solution later.
                fi

                echo "$time"
                #Save results to database
                mysql -u username -p -h localhost -e "USE printerusage; INSERT INTO printerlogs (time, username, pages, copies, printer, document, client, size, pcl, duplex, greyscale, filesize) VALUES ('$time', '$user', '$pages', '$copies', '$printer', '$document', '$client', '$size', '$pcl', '$duplex', '$greyscale', '$filesize');"
        fi
done < $filename
IFS=$OLDIFS

Which option is more suitable for this task? Will I have to create a second temporary file to get this done?

Thanks in advance!

Answer 1

As I wrote in another answer:

Rather than interfere with what is evidently source data, ie the stuff inside the quotes, you might consider replacing the field-separator commas (with say | ) instead:

s/,([^,"]*|"[^"]*")(?=(,|$))/|$1/g

And then splitting on | (assuming none of your data has | in it).

Is it possible to write a regular expression that matches a particular pattern and then does a replace with a part of the pattern

Answer 2

There is probably an easier way using sed alone, but this should work. Loop on the file, for each line match the parentheses with grep -o then replace the commas in the line with spaces (or whatever it is you would like to use to get rid of the commas - if you want to preserve the data you can use a non printable and explode it back to commas afterward).

i=1 && IFS=$(echo -en "\n\b") && for a in $(< test.txt); do 
 var="${a}"
 for b in $(sed -n ${i}p test.txt | grep -o '"[^"]*"'); do 
  repl="$(sed "s/,/ /g"  <<< "${b}")" 
  var="$(sed "s#${b}#${repl}#" <<< "${var}")" 
 done 
 let i+=1
 echo "${var}" 
done

Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

Question

2 answers

solution1
1 2015-11-11 15:12:50

solution2
0

Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

Question

2 answers

solution1 1 2015-11-11 15:12:50

solution2 0

solution1
1 2015-11-11 15:12:50

solution2
0