I have extracted many hive tables using show create table command.
The output is like this:
CREATE EXTERNAL TABLE MYSCHEMA.MyTABLE(
`col1` string,
`col2` string)
PARTITIONED BY (
`data_as_of_date` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'input.regex'='^(.*?)~}\\|(.*?)~}\\|(.*?)$')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'/mnt/data/schema/layer/domain/MYTABLE'
TBLPROPERTIES (
'DO_NOT_UPDATE_STATS'='true',
'STATS_GENERATED_VIA_STATS_TASK'='true',
'last_modified_by'='user',
'last_modified_time'='1603077305',
'numRows'='23483974',
'parquet.compression'='SNAPPY',
'transient_lastDdlTime'='1608243340');
I want to replace the text between...
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'input.regex'='^(.*?)~}\\|(.*?)~}\\|(.*?)$')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'/mnt/data/schema/layer/domain/MYTABLE'
TBLPROPERTIES (
'DO_NOT_UPDATE_STATS'='true',
'STATS_GENERATED_VIA_STATS_TASK'='true',
'last_modified_by'='user',
'last_modified_time'='1603077305',
'numRows'='23483974',
'parquet.compression'='SNAPPY',
'transient_lastDdlTime'='1608243340');
...to...
STORED AS PARQUET
LOCATION '/mnt/data/schema/layer/domain/MYTABLE'
TBLPROPERTIES('parquet.compression'='SNAPPY');
...using Notepad++.
Here if you observe, the LOCATION parameter should remain same from the original and rest should be replaced as stated above. Basically, the replace is spawning across multiple lines and I am also retaining some part of the text. Someone please guide with the regex that I can use in Notepad++ (v7.8.2).
The final result should look like this:
CREATE EXTERNAL TABLE MYSCHEMA.MyTABLE(
`col1` string,
`col2` string)
PARTITIONED BY (
`data_as_of_date` string)
STORED AS PARQUET
LOCATION '/mnt/data/schema/layer/domain/MYTABLE'
TBLPROPERTIES('parquet.compression'='SNAPPY');
There are many tables and each table has a different LOCATION parameter. Do not want the LOCATION to be replaced as mentioned above.
It is also fine if I can do this in 2 parts. First replacing everything above LOCATION and then replacing the TBLPROPERTIES (if it cannot be done in single regex).
ROW FORMAT SERDE[\s\S]+?(LOCATION\s+.+\R)[\s\S]*?TBLPROPERTIES[^)]+?\);
STORED AS PARQUET \n$1TBLPROPERTIES\('parquet.compression'='SNAPPY'\);
. matches newline
. matches newline
Explanation:
ROW FORMAT SERDE # literally
[\s\S]+? # 1 or more any character, including newline, not greedy
( # group 1
LOCATION # literally
\s+ # 1 or more spaces
.+ # 1 or more any character but newline
\R # any kind of linebreak
) # end group
[\s\S]*? # 1 or more any character, including newline, not greedy
TBLPROPERTIES # literally
[^)]+? # 1 or more any character that is not closing parenthesis
\); # closing parenthesis and semicolon
Replacement:
STORED AS PARQUET
\n
$1
TBLPROPERTIES\('parquet.compression'='SNAPPY'\);
Screenshot (before):
Screenshot (after):
I was able to do the same using two separate find and replace regex. Didn't knew it was going to be so simple with 2 times doing find and replace.
Replace: ROW FORMAT SERDE.*?LOCATION
with: STORED AS PARQUET\r\nLOCATION
Replace: TBLPROPERTIES.*?\)
with: TBLPROPERTIES \(\r\n 'parquet.compression'='SNAPPY'\)
I was having tough time to do this in single regex. Anyone?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.