Python - Pandas - 刪除第一次出現的字符和修復字符串之間的內容

Question

想象一下，我有那個數據框：

data = {'Script': ["create table table_name ( col_1 string , col_2 string , col_3 string ) row format serde 'org.apache.hadoop.hive.serde2.lazy.lazysimpleserde' with properties ( 'field.delim' ='\t' , 'serialization.format' ='\t' , 'serialization.null.format'='' ) stored as inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' location 'hdfs://nameservice1/table_name'tblproperties ( 'parquet.compress'='snappy' );"]}
df = pd.DataFrame(data)

基本上，列的內容是DDL：

create table table_name
  (
    col_1 string
  , col_2 string
  , col_3 string
  )
  row format serde 'org.apache.hadoop.hive.serde2.lazy.lazysimpleserde' with properties
  (
    'field.delim'              ='\t'
  , 'serialization.format'     ='\t'
  , 'serialization.null.format'=''
  )
  stored as inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' location 'hdfs://nameservice1/table_name'tblproperties
  (
    'parquet.compress'='snappy'
  )

我需要做的是刪除拳頭“（”和“位置”一詞之間的所有內容。基本上我的預期輸出是以下內容：

create table table_name
  (
    col_1 string
  , col_2 string
  , col_3 string
  )
  location 'hdfs://nameservice1/table_name'tblproperties
  (
    'parquet.compress'='snappy'
  )

為此，我正在嘗試使用正則表達式方法：

df['DDL'] = df.Script.str.replace(r")", " } ").str.replace(r'<}^>location+>', "")

然而，結果並不是想要的：

create table table_name
  (
    col_1 string
  , col_2 string
  , col_3 string
  }
  row format serde 'org.apache.hadoop.hive.serde2.lazy.lazysimpleserde' with properties
  (
    'field.delim'              ='\t'
  , 'serialization.format'     ='\t'
  , 'serialization.null.format'='' } stored as inputformat 'org.apache.hadoop.mapred.textinputformat' outputformat 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat' location 'hdfs://nameservice1/table_name'tblproperties ( 'parquet.compress'='snappy' }
;

我做錯了什么？ 用我的方法，我試圖在 { 和位置之間提取...

Answer 1

您可以使用

df['DDL'] = df['Script'].str.replace(r"(?s)^([^)]*)\).*?\b(location)\b", r"\1\2")

查看正則表達式演示

細節

(?s) - 一個內聯re.DOTALL選項制作. 匹配換行符
^ - 字符串的開始
([^)]*) - 第 1 組（替換模式中的\\1 ）：除)之外的任何 0+ 個字符
\\) - a )字符
.*? - 任何 0+ 個字符，盡可能少（ *?是非貪婪量詞）
\\b(location)\\b - 第 2 組（替換模式中的\\2 ）捕獲整個單詞location （ \\b代表單詞邊界）

Python - Pandas - 刪除第一次出現的字符和修復字符串之間的內容

問題描述

1 個解決方案

解決方案1
1 已采納 2020-02-26 18:33:34

Python - Pandas - 刪除第一次出現的字符和修復字符串之間的內容

問題描述

1 個解決方案

解決方案1 1 已采納 2020-02-26 18:33:34

解決方案1
1 已采納 2020-02-26 18:33:34