簡體   English   中英

Pandas 正則表達式:從列中提取 url 信息

[英]Pandas regex: extract url information from column

import pandas as pd    

d = {"Device_Type" : ["AXO145","TRU151","ZOD231","YRT326","LWR245"],
 "Stat_Access_Link" : ["<url>https://xcd32112.smart_meter.com</url>",
                       "<url>http://tXh67.dia_meter.com</url>",
                       "<url>https://yT5495.smart_meter.com</url>",
                       "<url>https://ret323_TRu.crown.com</url>",
                       "<url>https://luwr3243.celcius.com</url>"]}

df = pd.DataFrame(data = d)

我有一個這樣的 dataframe,我需要做的是使用正則表達式從標簽中提取 url 信息。 output 必須是這樣的:

設備類型 統計訪問鏈接
AXO145 xcd32112.smart_meter.com
TRU151 tXh67.dia_meter.com
ZOD231 yT5495.smart_meter.com
YRT326 ret323_TRu.crown.com
LWR245 luwr3243.celcius.com

任何幫助表示贊賞。

你真的需要正則表達式嗎?

如果您總是有<url>...</url> ,請使用:

df['Stat_Access_Link'].str[5:-6]

否則,您可以使用:

df['Stat_Access_Link'].str.extract(r'<url>(.*)</url>', expand=False)

# OR

df['Stat_Access_Link'].str.extract(r'<url>([^<>]*)</url>', expand=False)

output:

0    https://xcd32112.smart_meter.com
1          http://tXh67.dia_meter.com
2      https://yT5495.smart_meter.com
3        https://ret323_TRu.crown.com
4        https://luwr3243.celcius.com
Name: Stat_Access_Link, dtype: object

str.extract是你需要的:

d = {"Device_Type" : ["AXO145","TRU151","ZOD231","YRT326","LWR245"],
 "Stat_Access_Link" : ["<url>https://xcd32112.smart_meter.com</url>",
                       "<url>http://tXh67.dia_meter.com</url>",
                       "<url>https://yT5495.smart_meter.com</url>",
                       "<url>https://ret323_TRu.crown.com</url>",
                       "<url>https://luwr3243.celcius.com</url>"]}

df = pd.DataFrame(d)
pattern = re.compile(r"(?<=://)(.*)(?=</url)")
df['Stat_Access_Link'] = df['Stat_Access_Link'].str.extract(pattern, expand=False)
print(df)

Output:

  Device_Type          Stat_Access_Link
0      AXO145  xcd32112.smart_meter.com
1      TRU151       tXh67.dia_meter.com
2      ZOD231    yT5495.smart_meter.com
3      YRT326      ret323_TRu.crown.com
4      LWR245      luwr3243.celcius.com

在我看來,您應該考慮使用 pandas 解決方案pd.DataFrame.str.extract ,因為它內置於 pandas 中。

reg=r'\/\/([\s\S]*)<'
df['matched'] = df['Stat_Access_Link'].str.extract(reg)
print(df)

結果如下:

設備類型 統計訪問鏈接 匹配的
0 AXO145 https://xcd32112.smart_meter.com xcd32112.smart_meter.com
1個 TRU151 http://tXh67.dia_meter.com tXh67.dia_meter.com
2個 ZOD231 https://yT5495.smart_meter.com yT5495.smart_meter.com
3個 YRT326 https://ret323_TRu.crown.com ret323_TRu.crown.com
4個 LWR245 https://luwr3243.celcius.com luwr3243.celcius.com

資源

  1. pandas 提取物 — https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html
  2. 測試你的正則表達式——https://regex101.com/

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM