简体   繁体   中英

regex - Notepad++ search and replace losing lines

I'm very new to regex, and I'm trying to use Notepad++ to clean up some CSV files. I am running version 7.8.2 (64-bit), as my files are too large for the 32-bit version to open.

Within the data, most of the fields are standardized and automatically generated by the system. There are exactly 30 fields in each row. There is one field where user can enter comments, however, and in a few cases, users have entered a line break within this field. When this happens, Notepad++ creates a new line for this data.

For example, the third line below should be a continuation of the second line (edited from condensed example in the original post) :

"39901","0002286898","88","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 2016 PER ATTACHED SCHEDULE. FOR 39901, IU journal  2297455 CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","191313.130","07/28/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002290128","7","ACTUALS","To record accrued liabilities for goods or services received at June 30, 2016 per the attached schedule.  
Contact [NAME PHONE NUMBER EMAIL] or [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2556242.170","07/31/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291224","37","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE.  FOR 34530, CONTACT: [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","3010262.140","07/27/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291259","2","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE.  FOR 34571, CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","38140.260","07/27/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291336","12","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30  PER ATTACHED SCHEDULE. FOR 345.20","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2768000.000","08/01/2016","07/01/2016","","Accrued Liabilities",""

I am trying to remove the extra line feed in the second row so that the data instead looks like:

"39901","0002286898","88","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 2016 PER ATTACHED SCHEDULE. FOR 39901, IU journal  2297455 CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","191313.130","07/28/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002290128","7","ACTUALS","To record accrued liabilities for goods or services received at June 30, 2016 per the attached schedule.  Contact [NAME PHONE NUMBER EMAIL] or [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2556242.170","07/31/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291224","37","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE.  FOR 34530, CONTACT: [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","3010262.140","07/27/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291259","2","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE.  FOR 34571, CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","38140.260","07/27/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291336","12","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30  PER ATTACHED SCHEDULE. FOR 345.20","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2768000.000","08/01/2016","07/01/2016","","Accrued Liabilities",""

There is no carriage return, only a line feed, so searching for \\n also flags all of the line feeds that should legitimately end the line.

In this case, the data is structured so that the last column is always blank ("") . As such, I have tried to search for lines where the end is not blank – the line ends with a letter, number, period, space, etc. My plan is to replace these instances with a uniquely odd word, and then do a second, expanded search and replace to get rid of the new expression and the line feed.

Although unwieldy, I've been doing it in steps:

  • \\d{1}$ to find lines where the last character is a number;
  • \\w{1}$ to find lines where the last character is a letter;
  • \\s{1}$ to find lines where the last character is whitespace; and
  • \\.$ to find lines that end with a period.

I will then do a last search to find any stragglers that don't start with 39901 .

I run these searches as a regular search, and then replace with REPLACEHERE999_ , which I assume no one else has entered into the data. I understand that this will remove and replace the last character in the line – the final number, letter, space, etc. – but I can live with that. After these replacements have been made, I plan to then do a second, expanded search to swap out REPLACEHERE999_\\n with a space, getting rid of both REPLACEHERE999_ and the line feed.

When I do the first searches, they make a reasonable number of substitutions based on the number of errors I initially got in Power Query – 377 for \\d{1}$ , for example. Once I make these replacements, however, the number of lines drops significantly. Originally, I had 3,919,186 lines, but after the first search and replace – \\d{1}$ – I only had 1,543,818 lines, less than half of what I started with. When I work through the first few replacements one at a time, I don't lose lines, but when I use “Replace All,” they disappear.

Again, I just started with regex/Notepad++, so I may be missing some basic thing. But if I am only making a limited number of replacements, why are so many of my lines vanishing?

Comments and suggestions on my searches or thinking are welcome, but the disappearing lines are the crucial issue here.

Thanks!

  • Ctrl + H
  • Find what: \\R(?!“)
  • Replace with: LEAVE EMPTY
  • CHECK Wrap around
  • CHECK Regular expression
  • Replace all

Explanation:

\R          # any kind of linebreak
(?!“)       # negative lookahead, make sure we haven't “ after

Screen capture (before):

在此处输入图片说明

Screen capture (after):

在此处输入图片说明

Assume each row contains exactly 30 columns, and each column can contain any character other than the double quotes:

Turning on extended mode and regular expression search and wraparound, you can do it in two steps:

  1. Remove all the newlines. [Step 1]

  2. Use this regex, (("[^"]*",){29}("[^"]*")\\s?)
    and replace it with $1\\n from the "Replace With :" field. [Step 2] [Results]

Explanation:

  • Each field is of form "[^"]*" . In your case, there are 30 rows, the first 29 are followed by commas.
  • In my regex, the allowed characters are all the characters but the double quotes.
  • Let's express [^"] as \\x . Then each field is of form "\\x*" Then we have regex ("\\x*",{29}"\\x*") repeated several times. We add a new line for each segment of that form.
  • The \\s? can deal with the residual space after every 30 entries.

NOTE: The links use the previous, less inclusive regex.

Hacks

Other hack answers exist and they're all viable, just depends on how you want/need to go about it. I'm tackling hacks based on the end of the line rather than the start of the next line as other answers address (eg \\R(?!") as proposed by Toto in his answer ).

Reset hack: \\K

This particular method is a hack based on the current line's ending. Most other hacks here take the next line into account instead.

See regex in use here

[^" ] *\K\R

Alternatively, you could use ([^" ] *)\\R replace with $1

This matches all lines that have any non-space/non- " character followed by any number of whitespace characters, then resets the match (previously matched characters are no longer part of the final match), then matches newline character.

Skip/Fail hack: (*SKIP)(*FAIL)

Similar to previous, just using control verbs rather than reset token. Speed advantage over reset method.

See regex in use here

" *\R(*SKIP)(*FAIL)|\R

This matches all lines that end in " (then followed by any number of spaces), then skips those lines with a forced failure. The alternation of \\R matches newline combinations and in this case will only match where the first alternation doesn't.

Make sure you have Backward direction selected:

Notepad++ 使用上面定义的正则表达式设置图像并启用反向


Other answers here tackle checking next line and they're all great answers, so I won't provide any in my answer.


Balancing "

Unfortunately, matching balanced " is difficult in regex (not impossible, just not the best tool).

See regex in use here

("((?<!\\)\\(?:\\{2})*"|[^"\n\r])*"|^[^"\r\n]*"),? *(*SKIP)(*FAIL)|"[^"\r\n]*\K\R+

This pattern matches " followed by any non- " character or escaped \\" , and then the closing " ; or it matches any character except " or newline characters, then " . It then optionally matches a combination of , and/or any number of spaces. We then skip/fail these matches because they're all balanced " or the ends of the unbalanced " . We then match all the unbalanced " (where " opens on one line, but doesn't close on the same line), match up to the newline character, reset the match and match the newline character. The result is any newline character that breaks the balance of " .

This regex pattern is correct, but, unfortunately, this only works for matching or for the Replace function in Notepad++. I don't know why, but Replace All replaces 2 instances rather than 1.

Using Replace button (yields the message Replace: 1 occurrence was replaced. The next occurrence found ):

在 Notepad++ 中使用具有上述模式的替换功能

After clicking on Replace again, nothing happens:

第二次单击“替换”失败,尽管消息说它找到了下一次出现

As mentioned, Replace All replaces too much:

替换所有匹配的两个位置而不是一个

My suggestion? Use one of the hack patterns I describe above or one from another answer if you can. It's quick and dirty, but works. If you need to check balanced " , use the last pattern, just know that you'll have to click Replace for every single match.

PS I haven't been able to identify the Replace vs Replace All issue, but I'm on version 7.8.1 of Notepad++; this may be a version(s)-specific issue.


Result for each of the patterns described above in Notepad++:

"39901","0002286898","88","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 2016 PER ATTACHED SCHEDULE. FOR 39901, IU journal  2297455 CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","191313.130","07/28/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002290128","7","ACTUALS","To record accrued liabilities for goods or services received at June 30, 2016 per the attached schedule.  Contact [NAME PHONE NUMBER EMAIL] or [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2556242.170","07/31/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291224","37","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE.  FOR 34530, CONTACT: [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","3010262.140","07/27/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291259","2","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE.  FOR 34571, CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","38140.260","07/27/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291336","12","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30  PER ATTACHED SCHEDULE. FOR 345.20","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2768000.000","08/01/2016","07/01/2016","","Accrued Liabilities",""

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM