[英]Python Regex to extract content of src of an html tag?
I tried something like this but failed.我尝试过这样的事情但失败了。 I don't know regex can anyone help me with this?
我不知道正则表达式有人可以帮助我吗?
import re
html = """
<body>
<h1>dummy heading</h1>
<img src="/pic/earth.jpg" alt="planet" width="200">
<img src="/pic/redrose.jpg" alt="flower" width="200">
</body>
"""
x = re.search('^src=".*jpg$', html)
print(x)
I'm expecting output like this ['/pic/earth.jpg','/pic/redrose.jpg']我期待 output 像这样 ['/pic/earth.jpg','/pic/redrose.jpg']
Good first start, but you have several minor issues with your code:良好的开始,但您的代码有几个小问题:
^
and $
refer to the start and end of the string ^
和$
指的是字符串的开始和结束
.search()
returns Null
or a Match
object rather than the matched strings .search()
返回Null
或Match
object 而不是匹配的字符串.findall()
method.findall()
方法r"string"
strings for your regex coder"string"
字符串'
and "
for quotes, and that there could be a src=
attribute in something that is not an image'
和"
用于引号,并且在不是图像的东西中可能存在src=
属性Here are the docs: - https://docs.python.org/3/library/re.html#re.findall以下是文档: - https://docs.python.org/3/library/re.html#re.findall
Try this as a regex:试试这个作为正则表达式:
image_urls = re.findall(r'<img[^<>]+src=["\']([^"\'<>]+\.(?:gif|png|jpe?g))["\']', html, re.I)
print(image_urls)
>>> ['/pic/earth.jpg', '/pic/redrose.jpg']
To break this down a little:稍微分解一下:
re.findall()
return a list of strings re.findall()
返回一个字符串列表<img
we are looking to start in an image tag <img
我们希望从图像标签开始[^<>]+
1 or more chars that don't open/close the html tag [^<>]+
1 个或多个不打开/关闭 html 标记的字符
src=""
tag in the current <img>
<img>
中可能没有src=""
标记["\']
the HTML could use either type of quote ["\']
HTML 可以使用任一类型的报价[^"\'<>]+
keep reading 1+ chars whilst the string and the tag are not closed [^"\'<>]+
在字符串和标签未关闭时继续读取 1+ 个字符\.
literal dots need to be escaped, else they mean the "match anything" special char(?:gif|png|jpe?g)
a range of possible file extensions, but don't create a capture bracket for them (which would return these in your array) (?:gif|png|jpe?g)
一系列可能的文件扩展名,但不要为它们创建捕获括号(这将在您的数组中返回这些)([^"\'<>]+\.(?:gif|png|jpe?g))
this is the capture bracket for what will actually get returned for each match ([^"\'<>]+\.(?:gif|png|jpe?g))
这是每个匹配实际返回的捕获括号["\']
search for the closing quote to end the capture bracket ["\']
搜索结束引号以结束捕获括号re.I
make the regex case insensitive re.I
使正则表达式不区分大小写 I'm not good at regEx.我不擅长正则表达式。 So my answer may not be best.
所以我的回答可能不是最好的。
Try this.尝试这个。
x = re.findall(r'(?=src)src=\"(?P<src>[^\"]+)', html)
than you can see x like below.比你可以看到下面的 x 。
['/pic/earth.jpg', '/pic/redrose.jpg']
RegEx explanation:正则表达式解释:
(?=src): positive lookup --> only see those have src word (?=src): 正向查找 --> 只看到那些有src字的
src=\": must include this specific word src=" src=\": 必须包含这个特定的单词src="
(?P somthing): this expression grouping somthing to name src (?P somthing):这个表达式将 somthing 分组命名为src
[^\"]+: everything except " character [^\"]+: 除了 " 字符之外的所有内容
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.