简体   繁体   English

Pandas 正则表达式:从列中提取 url 信息

[英]Pandas regex: extract url information from column

import pandas as pd    

d = {"Device_Type" : ["AXO145","TRU151","ZOD231","YRT326","LWR245"],
 "Stat_Access_Link" : ["<url>https://xcd32112.smart_meter.com</url>",
                       "<url>http://tXh67.dia_meter.com</url>",
                       "<url>https://yT5495.smart_meter.com</url>",
                       "<url>https://ret323_TRu.crown.com</url>",
                       "<url>https://luwr3243.celcius.com</url>"]}

df = pd.DataFrame(data = d)

I have a dataframe like this and what I need to do is extract url information from tags using regex.我有一个这样的 dataframe,我需要做的是使用正则表达式从标签中提取 url 信息。 The output has to be like this: output 必须是这样的:

Device_Type设备类型 Stat_Access_Link统计访问链接
AXO145 AXO145 xcd32112.smart_meter.com
TRU151 TRU151 tXh67.dia_meter.com
ZOD231 ZOD231 yT5495.smart_meter.com
YRT326 YRT326 ret323_TRu.crown.com
LWR245 LWR245 luwr3243.celcius.com

Any help is appreciated.任何帮助表示赞赏。

Do you really need a regex?你真的需要正则表达式吗?

If you always have <url>...</url> , use:如果您总是有<url>...</url> ,请使用:

df['Stat_Access_Link'].str[5:-6]

else, you could use:否则,您可以使用:

df['Stat_Access_Link'].str.extract(r'<url>(.*)</url>', expand=False)

# OR

df['Stat_Access_Link'].str.extract(r'<url>([^<>]*)</url>', expand=False)

output: output:

0    https://xcd32112.smart_meter.com
1          http://tXh67.dia_meter.com
2      https://yT5495.smart_meter.com
3        https://ret323_TRu.crown.com
4        https://luwr3243.celcius.com
Name: Stat_Access_Link, dtype: object

str.extract is what you need: str.extract是你需要的:

d = {"Device_Type" : ["AXO145","TRU151","ZOD231","YRT326","LWR245"],
 "Stat_Access_Link" : ["<url>https://xcd32112.smart_meter.com</url>",
                       "<url>http://tXh67.dia_meter.com</url>",
                       "<url>https://yT5495.smart_meter.com</url>",
                       "<url>https://ret323_TRu.crown.com</url>",
                       "<url>https://luwr3243.celcius.com</url>"]}

df = pd.DataFrame(d)
pattern = re.compile(r"(?<=://)(.*)(?=</url)")
df['Stat_Access_Link'] = df['Stat_Access_Link'].str.extract(pattern, expand=False)
print(df)

Output: Output:

  Device_Type          Stat_Access_Link
0      AXO145  xcd32112.smart_meter.com
1      TRU151       tXh67.dia_meter.com
2      ZOD231    yT5495.smart_meter.com
3      YRT326      ret323_TRu.crown.com
4      LWR245      luwr3243.celcius.com

In my opinion, you should consider using the pandas solution pd.DataFrame.str.extract since it is built into pandas.在我看来,您应该考虑使用 pandas 解决方案pd.DataFrame.str.extract ,因为它内置于 pandas 中。

reg=r'\/\/([\s\S]*)<'
df['matched'] = df['Stat_Access_Link'].str.extract(reg)
print(df)

Here are what the results look like:结果如下:

Device_Type设备类型 Stat_Access_Link统计访问链接 matched匹配的
0 0 AXO145 AXO145 https://xcd32112.smart_meter.com https://xcd32112.smart_meter.com xcd32112.smart_meter.com xcd32112.smart_meter.com
1 1个 TRU151 TRU151 http://tXh67.dia_meter.com http://tXh67.dia_meter.com tXh67.dia_meter.com tXh67.dia_meter.com
2 2个 ZOD231 ZOD231 https://yT5495.smart_meter.com https://yT5495.smart_meter.com yT5495.smart_meter.com yT5495.smart_meter.com
3 3个 YRT326 YRT326 https://ret323_TRu.crown.com https://ret323_TRu.crown.com ret323_TRu.crown.com ret323_TRu.crown.com
4 4个 LWR245 LWR245 https://luwr3243.celcius.com https://luwr3243.celcius.com luwr3243.celcius.com luwr3243.celcius.com

Resources资源

  1. pandas extract — https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html pandas 提取物 — https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html
  2. test your regex — https://regex101.com/测试你的正则表达式——https://regex101.com/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM