简体   繁体   中英

Notepad++ replace text between two strings spawning multiple lines and retain some part of the string in between

I have extracted many hive tables using show create table command.

The output is like this:

CREATE EXTERNAL TABLE MYSCHEMA.MyTABLE(
  `col1` string, 
  `col2` string)
PARTITIONED BY ( 
  `data_as_of_date` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'input.regex'='^(.*?)~}\\|(.*?)~}\\|(.*?)$') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/mnt/data/schema/layer/domain/MYTABLE'
TBLPROPERTIES (
  'DO_NOT_UPDATE_STATS'='true', 
  'STATS_GENERATED_VIA_STATS_TASK'='true', 
  'last_modified_by'='user', 
  'last_modified_time'='1603077305', 
  'numRows'='23483974', 
  'parquet.compression'='SNAPPY', 
  'transient_lastDdlTime'='1608243340');

I want to replace the text between...

ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'input.regex'='^(.*?)~}\\|(.*?)~}\\|(.*?)$') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/mnt/data/schema/layer/domain/MYTABLE'
TBLPROPERTIES (
  'DO_NOT_UPDATE_STATS'='true', 
  'STATS_GENERATED_VIA_STATS_TASK'='true', 
  'last_modified_by'='user', 
  'last_modified_time'='1603077305', 
  'numRows'='23483974', 
  'parquet.compression'='SNAPPY', 
  'transient_lastDdlTime'='1608243340');

...to...

STORED AS PARQUET 
LOCATION '/mnt/data/schema/layer/domain/MYTABLE'
TBLPROPERTIES('parquet.compression'='SNAPPY');

...using Notepad++.

Here if you observe, the LOCATION parameter should remain same from the original and rest should be replaced as stated above. Basically, the replace is spawning across multiple lines and I am also retaining some part of the text. Someone please guide with the regex that I can use in Notepad++ (v7.8.2).

The final result should look like this:

CREATE EXTERNAL TABLE MYSCHEMA.MyTABLE(
      `col1` string, 
      `col2` string)
    PARTITIONED BY ( 
      `data_as_of_date` string)
STORED AS PARQUET 
LOCATION '/mnt/data/schema/layer/domain/MYTABLE'
TBLPROPERTIES('parquet.compression'='SNAPPY');

There are many tables and each table has a different LOCATION parameter. Do not want the LOCATION to be replaced as mentioned above.

It is also fine if I can do this in 2 parts. First replacing everything above LOCATION and then replacing the TBLPROPERTIES (if it cannot be done in single regex).

  • Ctrl + H
  • Find what: ROW FORMAT SERDE[\s\S]+?(LOCATION\s+.+\R)[\s\S]*?TBLPROPERTIES[^)]+?\);
  • Replace with: STORED AS PARQUET \n$1TBLPROPERTIES\('parquet.compression'='SNAPPY'\);
  • CHECK Match case
  • CHECK Wrap around
  • CHECK Regular expression
  • UNCHECK . matches newline . matches newline
  • Replace all

Explanation:

ROW FORMAT SERDE        # literally
[\s\S]+?                # 1 or more any character, including newline, not greedy
(                       # group 1
LOCATION                # literally
\s+                     # 1 or more spaces
.+                      # 1 or more any character but newline
\R                      # any kind of linebreak
)                       # end group
[\s\S]*?                # 1 or more any character, including newline, not greedy
TBLPROPERTIES           # literally
[^)]+?                  # 1 or more any character that is not closing parenthesis
\);                     # closing parenthesis and semicolon

Replacement:

STORED AS PARQUET 
\n
$1
TBLPROPERTIES\('parquet.compression'='SNAPPY'\);

Screenshot (before):

在此处输入图像描述

Screenshot (after):

在此处输入图像描述

I was able to do the same using two separate find and replace regex. Didn't knew it was going to be so simple with 2 times doing find and replace.

  1. Replace: ROW FORMAT SERDE.*?LOCATION with: STORED AS PARQUET\r\nLOCATION

  2. Replace: TBLPROPERTIES.*?\) with: TBLPROPERTIES \(\r\n 'parquet.compression'='SNAPPY'\)

I was having tough time to do this in single regex. Anyone?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM