I have a set of large data files that I need to bulk load into a database. The data lines are pipe | delimited but also use inverted commas " to delineate text entries. The problem is that many of the blocks of text also include one or more additional " characters other than the ones at the beginning / end of the field, which is breaking the import.
I'm looking to find a regular expression that will allow me to find lines in the file that contain more than two " characters between each set of || delimiters.
For example
123|"Mr Smith"|456|"No extra inverted commas, This line is fine"|789
123|"Mr Jones"|456|"This one has "extra inverted commas", not so good"|789
123|"Mr Jones"|456|"Even one additional " is a bit of an issue"|789
I need to find lines which are like the second and third one above.
Any assistance appreciated!
Thanks
It can be done by piping the output to awk as below. It is assumed that pipe in input does appear at the start and end of each line.
| awk -F'|' 'BEGIN{OFS="|";}{ \
for (i = 1; i <= NF; i++) { \
if (gsub(/"/, "\"", $i) > 2) { \
print; break; \
}; \
}; \
}'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.