[英]Python: Is there a way to find and remove the first and last occurrence of a character in a string?
[英]Python - Pandas - Remove content between the first occurrence of a character and a fix string
想象一下,我有那個數據框:
data = {'Script': ["create table table_name ( col_1 string , col_2 string , col_3 string ) row format serde 'org.apache.hadoop.hive.serde2.lazy.lazysimpleserde' with properties ( 'field.delim' ='\t' , 'serialization.format' ='\t' , 'serialization.null.format'='' ) stored as inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' location 'hdfs://nameservice1/table_name'tblproperties ( 'parquet.compress'='snappy' );"]}
df = pd.DataFrame(data)
基本上,列的內容是DDL:
create table table_name
(
col_1 string
, col_2 string
, col_3 string
)
row format serde 'org.apache.hadoop.hive.serde2.lazy.lazysimpleserde' with properties
(
'field.delim' ='\t'
, 'serialization.format' ='\t'
, 'serialization.null.format'=''
)
stored as inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' location 'hdfs://nameservice1/table_name'tblproperties
(
'parquet.compress'='snappy'
)
我需要做的是刪除拳頭“(”和“位置”一詞之間的所有內容。基本上我的預期輸出是以下內容:
create table table_name
(
col_1 string
, col_2 string
, col_3 string
)
location 'hdfs://nameservice1/table_name'tblproperties
(
'parquet.compress'='snappy'
)
為此,我正在嘗試使用正則表達式方法:
df['DDL'] = df.Script.str.replace(r")", " } ").str.replace(r'<}^>location+>', "")
然而,結果並不是想要的:
create table table_name
(
col_1 string
, col_2 string
, col_3 string
}
row format serde 'org.apache.hadoop.hive.serde2.lazy.lazysimpleserde' with properties
(
'field.delim' ='\t'
, 'serialization.format' ='\t'
, 'serialization.null.format'='' } stored as inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' location 'hdfs://nameservice1/table_name'tblproperties ( 'parquet.compress'='snappy' }
;
我做錯了什么? 用我的方法,我試圖在 { 和位置之間提取...
您可以使用
df['DDL'] = df['Script'].str.replace(r"(?s)^([^)]*)\).*?\b(location)\b", r"\1\2")
查看正則表達式演示
細節
(?s)
- 一個內聯re.DOTALL
選項制作.
匹配換行符^
- 字符串的開始([^)]*)
- 第 1 組(替換模式中的\\1
):除)
之外的任何 0+ 個字符\\)
- a )
字符.*?
- 任何 0+ 個字符,盡可能少( *?
是非貪婪量詞)\\b(location)\\b
- 第 2 組(替換模式中的\\2
)捕獲整個單詞location
( \\b
代表單詞邊界)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.