簡體   English   中英

regex - Notepad++ 搜索並替換丟失的行

[英]regex - Notepad++ search and replace losing lines

我對正則表達式很陌生,我正在嘗試使用 Notepad++ 來清理一些 CSV 文件。 我正在運行版本 7.8.2(64 位),因為我的文件太大而無法打開 32 位版本。

在數據中,大部分字段都是標准化的,由系統自動生成。 每行正好有 30 個字段。 但是,用戶可以在一個字段中輸入注釋,並且在少數情況下,用戶在該字段中輸入了換行符。 發生這種情況時,Notepad++ 會為此數據創建一個新行。

例如,下面的第三行應該是第二行的延續(從原始帖子中的精簡示例中編輯)

"39901","0002286898","88","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 2016 PER ATTACHED SCHEDULE. FOR 39901, IU journal  2297455 CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","191313.130","07/28/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002290128","7","ACTUALS","To record accrued liabilities for goods or services received at June 30, 2016 per the attached schedule.  
Contact [NAME PHONE NUMBER EMAIL] or [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2556242.170","07/31/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291224","37","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE.  FOR 34530, CONTACT: [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","3010262.140","07/27/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291259","2","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE.  FOR 34571, CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","38140.260","07/27/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291336","12","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30  PER ATTACHED SCHEDULE. FOR 345.20","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2768000.000","08/01/2016","07/01/2016","","Accrued Liabilities",""

我試圖刪除第二行中的額外換行符,以便數據看起來像:

"39901","0002286898","88","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 2016 PER ATTACHED SCHEDULE. FOR 39901, IU journal  2297455 CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","191313.130","07/28/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002290128","7","ACTUALS","To record accrued liabilities for goods or services received at June 30, 2016 per the attached schedule.  Contact [NAME PHONE NUMBER EMAIL] or [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2556242.170","07/31/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291224","37","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE.  FOR 34530, CONTACT: [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","3010262.140","07/27/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291259","2","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE.  FOR 34571, CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","38140.260","07/27/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291336","12","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30  PER ATTACHED SCHEDULE. FOR 345.20","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2768000.000","08/01/2016","07/01/2016","","Accrued Liabilities",""

沒有回車,只有換行,所以搜索\\n也會標記所有應該合法結束該行的換行。

在這種情況下,數據的結構使最后一列始終為空白("") 因此,我嘗試搜索結尾不是空白的行——行以字母、數字、句點、空格等結尾。我的計划是用唯一的奇數詞替換這些實例,然后再做一秒鍾, 擴展搜索和替換以去除新表達式和換行符。

雖然笨拙,但我一直在按步驟進行操作:

  • \\d{1}$查找最后一個字符是數字的行;
  • \\w{1}$查找最后一個字符是字母的行;
  • \\s{1}$查找最后一個字符是空格的行;
  • \\.$查找以句點結尾的行。

然后,我將進行最后一次搜索以查找所有不以39901開頭的39901

我將這些搜索作為常規搜索運行,然后替換為REPLACEHERE999_ ,我假設其他人沒有輸入數據。 我知道這將刪除並替換行中的最后一個字符——最后一個數字、字母、空格等——但我可以接受。 完成這些替換后,我計划進行第二次擴展搜索,用空格替換REPLACEHERE999_\\n ,去掉REPLACEHERE999_和換行符。

當我進行第一次搜索時,他們會根據我最初在 Power Query 中遇到的錯誤數量進行合理數量的替換 - 例如,對於\\d{1}$為 377。 但是,一旦我進行了這些替換,行數就會顯着下降。 最初,我有 3,919,186 行,但在第一次搜索和替換之后 - \\d{1}$ - 我只有 1,543,818 行,不到我開始時的一半。 當我一次完成前幾個替換時,我不會丟失行,但是當我使用“全部替換”時,它們消失了。

同樣,我剛開始使用正則表達式/記事本++,所以我可能會遺漏一些基本的東西。 但是,如果我只進行有限數量的替換,為什么我的很多行都消失了?

歡迎對我的搜索或想法提出意見和建議,但消失的線條是這里的關鍵問題。

謝謝!

  • Ctrl + H
  • 找出什么: \\R(?!“)
  • 替換為: LEAVE EMPTY
  • 檢查環繞
  • 檢查正則表達式
  • 全部替換

解釋:

\R          # any kind of linebreak
(?!“)       # negative lookahead, make sure we haven't “ after

屏幕截圖(之前):

在此處輸入圖片說明

屏幕截圖(之后):

在此處輸入圖片說明

假設每行正好包含30列,並且每列可以包含除雙引號之外的任何字符:

打開擴展模式正則表達式搜索和環繞,可以分兩步完成:

  1. 刪除所有換行符。 [第1步]

  2. 使用這個正則表達式, (("[^"]*",){29}("[^"]*")\\s?)
    並將其替換為“替換為:”字段中的$1\\n [步驟 2] [結果]

解釋:

  • 每個字段的格式為"[^"]*" 。在您的情況下,有 30 行,前 29 行后跟逗號。
  • 在我正則表達式中,允許的字符是所有字符,但雙引號。
  • 讓我們將[^"]\\x 。然后每個字段的形式都是"\\x*"然后我們將正則表達式("\\x*",{29}"\\x*")重復幾次。我們添加一個新行對於該表格的每個部分。
  • \\s? 可以處理每30個條目后的剩余空間。

注意:鏈接使用以前的、包容性較低的正則表達式。

黑客

存在其他黑客答案,它們都是可行的,僅取決於您想要/需要如何去做。 我正在根據行尾而不是下一行的開頭來解決黑客問題,因為其他答案地址(例如托托在他的回答中提出的\\R(?!") )。

重置黑客: \\K

這個特殊的方法是基於當前行結尾的一個 hack。 此處的大多數其他 hack 都會考慮下一行。

請參閱此處使用的正則表達式

[^" ] *\K\R

或者,您可以使用([^" ] *)\\R替換為$1

這匹配所有具有任何非空格/非"字符后跟任意數量的空白字符的行,然后重置匹配(先前匹配的字符不再是最終匹配的一部分),然后匹配換行符。

跳過/失敗黑客: (*SKIP)(*FAIL)

與之前類似,僅使用控制動詞而不是重置標記。 速度優於復位方法。

請參閱此處使用的正則表達式

" *\R(*SKIP)(*FAIL)|\R

這匹配所有以"結尾的行(然后是任意數量的空格),然后跳過那些強制失敗的行。 \\R的交替匹配換行符組合,在這種情況下只會匹配第一個交替不匹配的地方。

確保您選擇了向后方向

Notepad++ 使用上面定義的正則表達式設置圖像並啟用反向


此處的其他答案涉及檢查下一行,它們都是很好的答案,因此我不會在答案中提供任何內容。


平衡"

不幸的是,在正則表達式中匹配“平衡"很困難(並非不可能,只是不是最好的工具)。

請參閱此處使用的正則表達式

("((?<!\\)\\(?:\\{2})*"|[^"\n\r])*"|^[^"\r\n]*"),? *(*SKIP)(*FAIL)|"[^"\r\n]*\K\R+

此模式匹配"后跟任何非"字符或轉義的\\" ,然后是結束的" ; 或者它匹配除"或換行符之外的任何字符,然后是" 然后它可以選擇匹配,和/或任意數量的空格的組合。 然后我們跳過/失敗這些匹配,因為它們都是平衡的"或不平衡的末端" 然后我們匹配所有不平衡的" (其中"在一行打開,但不在同一行關閉),匹配到換行符,重置匹配並匹配換行符。 其結果是任何換行符,打破的平衡"

這個正則表達式模式是正確的,但不幸的是,這只適用於匹配或 Notepad++ 中的替換功能。 我不知道為什么,但全部替換替換了 2 個實例而不是 1 個。

使用替換按鈕(產生消息替換:1 次出現被替換。找到下一次出現):

在 Notepad++ 中使用具有上述模式的替換功能

再次單擊替換后,沒有任何反應:

第二次單擊“替換”失敗,盡管消息說它找到了下一次出現

如前所述,全部替換替換太多:

替換所有匹配的兩個位置而不是一個

我的建議? 如果可以,請使用我上面描述的一種黑客模式或另一個答案中的一種。 它又快又臟,但有效。 如果您需要檢查平衡" ,請使用最后一個模式,只需知道您必須為每個匹配單擊替換

PS 我無法確定替換與全部替換問題,但我使用的是 Notepad++ 7.8.1 版; 這可能是特定於版本的問題。


Notepad++ 中上述每種模式的結果:

"39901","0002286898","88","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 2016 PER ATTACHED SCHEDULE. FOR 39901, IU journal  2297455 CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","191313.130","07/28/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002290128","7","ACTUALS","To record accrued liabilities for goods or services received at June 30, 2016 per the attached schedule.  Contact [NAME PHONE NUMBER EMAIL] or [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2556242.170","07/31/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291224","37","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE.  FOR 34530, CONTACT: [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","3010262.140","07/27/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291259","2","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE.  FOR 34571, CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","38140.260","07/27/2016","07/01/2016","","Accrued Liabilities","" 
"39901","0002291336","12","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30  PER ATTACHED SCHEDULE. FOR 345.20","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2768000.000","08/01/2016","07/01/2016","","Accrued Liabilities",""

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM