使用正則表達式從 Pandas df 中提取字符串

Question

我需要有關 Python Pandas 數據框的正則表達式的幫助。 測試字符串將是：

s = pd.Series(['xslF345X03/was-form4_163347386959085.xml', 'xslF345X03/wf-form4_163347386959085.xmlasdf', 'xslF345/X03/wf-form4_163347386959085.xml'])

我想提取以便我得到這樣的東西：

xslF345X03/was-form4_163347386959085.xml      Extract: /was-form4_163347386959085.xml
xslF345X03/wf-form4_163347386959085.xmlasdf   Do not extract because the ending is not .xml
xslF345/X03/wf-form4_163347386959085.xml      Extract starting from the last '/' character: /wf-form4_163347386959085.xml

我想我需要遵循 Pandas 代碼來使用正則表達式提取：

s.str.extract(...)

先感謝您：-）

Answer 1

使用str.extract ：

>>> s.str.extract(r'.*/(.*\.xml)$')
                               0
0  was-form4_163347386959085.xml
1                            NaN
2   wf-form4_163347386959085.xml

Answer 2

要從最后一個 '/' 字符（包括/ ）開始提取到.xml結尾，請使用str.extract() ，如下所示：

s.str.extract(r'(/(?!.*/).*\.xml)$')

正則表達式演示

正則表達式詳情：

( - str.extract()的捕獲組的開始

/ - 匹配符號 / 字面意思

(?!.*/) - 負前瞻正則表達式斷言沒有進一步的符號 / 在它之后（即確保符號 / 是最后一個）

.* - 匹配零個或多個字符

\\. - 從字面上匹配一個點（轉義以避免與正則表達式元字符混淆）

xml - 從字面上匹配字符串xml

) - str.extract()的捕獲組結束

$ - 在行尾斷言（以確保.xml在末尾）

結果：

                                0
0  /was-form4_163347386959085.xml
1                             NaN
2   /wf-form4_163347386959085.xml

Answer 3

您可以檢查str.endswith然后傳遞給np.where

np.where(s.str.endswith('.xml'),s.str.rsplit('/',n=1).str[-1],np.nan)
Out[99]: 
array(['was-form4_163347386959085.xml', nan,
       'wf-form4_163347386959085.xml'], dtype=object)

使用正則表達式從 Pandas df 中提取字符串

問題描述

3 個解決方案

解決方案1
1 2021-10-18 19:35:44

解決方案2
1 已采納 2021-10-18 20:13:56

解決方案3
0 2021-10-18 19:29:23

使用正則表達式從 Pandas df 中提取字符串

問題描述

3 個解決方案

解決方案1 1 2021-10-18 19:35:44

解決方案2 1 已采納 2021-10-18 20:13:56

解決方案3 0 2021-10-18 19:29:23

解決方案1
1 2021-10-18 19:35:44

解決方案2
1 已采納 2021-10-18 20:13:56

解決方案3
0 2021-10-18 19:29:23