Python 正则表达式提取 html 标签的 src 内容？

Question

I tried something like this but failed.我尝试过这样的事情但失败了。 I don't know regex can anyone help me with this?我不知道正则表达式有人可以帮助我吗？

import re

html = """
<body>
<h1>dummy heading</h1>
<img src="/pic/earth.jpg" alt="planet" width="200">
<img src="/pic/redrose.jpg" alt="flower" width="200">
</body>
"""
x = re.search('^src=".*jpg$', html)
print(x)

I'm expecting output like this ['/pic/earth.jpg','/pic/redrose.jpg']我期待 output 像这样 ['/pic/earth.jpg','/pic/redrose.jpg']

Answer 1

Good first start, but you have several minor issues with your code:良好的开始，但您的代码有几个小问题：

^ and $ refer to the start and end of the string ^和$指的是字符串的开始和结束
- or end-of-line with re.MULTILINE flag enabled或启用 re.MULTILINE 标志的行尾
.search() returns Null or a Match object rather than the matched strings .search .search()返回Null或Match object 而不是匹配的字符串
you probably want the .findall() method你可能想要.findall()方法
if you have backslashed in your regex (which you don't yet), then you may want to use raw r"string" strings for your regex code如果你在你的正则表达式中使用了反斜杠（你还没有），那么你可能想为你的正则表达式代码使用原始的r"string"字符串
also think of all the possible permutations of what could be in your input data, such as HTML allowing both ' and " for quotes, and that there could be a src= attribute in something that is not an image还要考虑输入数据中可能存在的所有可能排列，例如 HTML 允许'和"用于引号，并且在不是图像的东西中可能存在src=属性

Here are the docs: - https://docs.python.org/3/library/re.html#re.findall以下是文档： - https://docs.python.org/3/library/re.html#re.findall

Try this as a regex:试试这个作为正则表达式：

image_urls = re.findall(r'<img[^<>]+src=["\']([^"\'<>]+\.(?:gif|png|jpe?g))["\']', html, re.I)
print(image_urls)
>>> ['/pic/earth.jpg', '/pic/redrose.jpg']

To break this down a little:稍微分解一下：

re.findall() return a list of strings re.findall()返回一个字符串列表
<img we are looking to start in an image tag <img我们希望从图像标签开始
[^<>]+ 1 or more chars that don't open/close the html tag [^<>]+ 1 个或多个不打开/关闭 html 标记的字符
- there might not be a src="" tag in the current <img>当前<img>中可能没有src=""标记
["\'] the HTML could use either type of quote ["\'] HTML 可以使用任一类型的报价
[^"\'<>]+ keep reading 1+ chars whilst the string and the tag are not closed [^"\'<>]+在字符串和标签未关闭时继续读取 1+ 个字符
\. literal dots need to be escaped, else they mean the "match anything" special char文字点需要转义，否则它们意味着“匹配任何东西”特殊字符
(?:gif|png|jpe?g) a range of possible file extensions, but don't create a capture bracket for them (which would return these in your array) (?:gif|png|jpe?g)一系列可能的文件扩展名，但不要为它们创建捕获括号（这将在您的数组中返回这些）
([^"\'<>]+\.(?:gif|png|jpe?g)) this is the capture bracket for what will actually get returned for each match ([^"\'<>]+\.(?:gif|png|jpe?g))这是每个匹配实际返回的捕获括号
["\'] search for the closing quote to end the capture bracket ["\']搜索结束引号以结束捕获括号
re.I make the regex case insensitive re.I使正则表达式不区分大小写

Answer 2

I'm not good at regEx.我不擅长正则表达式。 So my answer may not be best.所以我的回答可能不是最好的。

Try this.尝试这个。

x = re.findall(r'(?=src)src=\"(?P<src>[^\"]+)', html)

than you can see x like below.比你可以看到下面的 x 。

['/pic/earth.jpg', '/pic/redrose.jpg']

RegEx explanation:正则表达式解释：

(?=src): positive lookup --> only see those have src word (?=src): 正向查找 --> 只看到那些有src字的

src=\": must include this specific word src=" src=\": 必须包含这个特定的单词src="

(?P somthing): this expression grouping somthing to name src (?P somthing)：这个表达式将 somthing 分组命名为src

[^\"]+: everything except " character [^\"]+: 除了 " 字符之外的所有内容

Python 正则表达式提取 html 标签的 src 内容？

问题描述

2 个解决方案

解决方案1
2 2020-06-04 10:43:45

解决方案2
1 已采纳 2020-06-04 10:35:42

Python 正则表达式提取 html 标签的 src 内容？

问题描述

2 个解决方案

解决方案1 2 2020-06-04 10:43:45

解决方案2 1 已采纳 2020-06-04 10:35:42

解决方案1
2 2020-06-04 10:43:45

解决方案2
1 已采纳 2020-06-04 10:35:42