简体   繁体   中英

How to append lines that match two patterns to the previous line in a file?

I have a csv file where what's supposed to be a single line, is split across several. I need help to find a way to join the lines that are split. Also, the number of fields (separated by ,) is not fixed.

A correct line has the following pattern:

X,X,X,"() ",Y,H where X can be any number of fields. However, the bold part (end of the string) is fixed. Y and H are both one word.

The issue is that this line can appear as (or any variant of this):

X,X,

X, "()"

,Y,H

What I need is a way (awk, sed) of appending the lines that don't have 24 or more commas and do not end with ",Y,H, to the previous line.

Please bear in mind that it's a large file, although I have 256 GB of RAM.

Example

  • Correct lines

a, b, c, "()", h, k

a, b, c, d, "()", h, k

  • Same lines in the file

First line

a, b, c,

"()", h, k

Second line

a, b, c, d, "()"

, h

, k

So far I've tried this (not working):

awk '/"[:space:]*,[:space:]*[:alpha:]+[:space:]*,[:space:]*[:alpha:]+$/{print}' check.csv

to try to find the lines ending with ", X, Y where X and Y are words.

Also, as the minimum number of correct fields is 24, I've used:

awk 'NF<24{print}' check.csv

to filter out lines with less than 24 fields.

My idea is to detect lines that match both regular expressions and append them to the previous line.

Thank you!

This might work for you (GNU sed):

sed '/"()", *[^,]\+, *[^,]\+$/b;:a;N;s/\n//;/"()", *[^,]\+, *[^,]\+$/!ba;P;D' file

Do not process a correct line, just bail out.

Otherwise append the next line, remove the introduced newline and try and match again.

Repeat until a match, then print/delete the first line and repeat.

perl -lanF, -e 'push @L, grep length, @F; if ($L[-3] eq q/"()"/) { print join ",", @L; @L=() }' file

  • use -l -n -e to loop over input lines w/o printing, append linebreaks to output
  • use -a -F, to create @F array by splitting input on commas
  • push @L, grep length, @F push nonempty fields onto @L
  • if ($L[-3] eq q/"()"/) - if the 3rd to last accumulated field is the magic marker:
    • print join ",", @L print all of @L joined with commas
    • @L=() reset @L

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM