简体   繁体   中英

Regular Expression to find more than two occurrences of a character between delimiters

I have a set of large data files that I need to bulk load into a database. The data lines are pipe | delimited but also use inverted commas " to delineate text entries. The problem is that many of the blocks of text also include one or more additional " characters other than the ones at the beginning / end of the field, which is breaking the import.

I'm looking to find a regular expression that will allow me to find lines in the file that contain more than two " characters between each set of || delimiters.

For example

123|"Mr Smith"|456|"No extra inverted commas, This line is fine"|789

123|"Mr Jones"|456|"This one has "extra inverted commas", not so good"|789

123|"Mr Jones"|456|"Even one additional " is a bit of an issue"|789

I need to find lines which are like the second and third one above.

Any assistance appreciated!

Thanks

It can be done by piping the output to awk as below. It is assumed that pipe in input does appear at the start and end of each line.

| awk -F'|' 'BEGIN{OFS="|";}{ \
  for (i = 1; i <= NF; i++) { \
    if (gsub(/"/, "\"", $i) > 2) { \
      print; break; \
    }; \
  }; \
}'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM