简体   繁体   English

Python 正则表达式提取 html 标签的 src 内容?

[英]Python Regex to extract content of src of an html tag?

I tried something like this but failed.我尝试过这样的事情但失败了。 I don't know regex can anyone help me with this?我不知道正则表达式有人可以帮助我吗?

import re

html = """
<body>
<h1>dummy heading</h1>
<img src="/pic/earth.jpg" alt="planet" width="200">
<img src="/pic/redrose.jpg" alt="flower" width="200">
</body>
"""
x = re.search('^src=".*jpg$', html)
print(x)

I'm expecting output like this ['/pic/earth.jpg','/pic/redrose.jpg']我期待 output 像这样 ['/pic/earth.jpg','/pic/redrose.jpg']

Good first start, but you have several minor issues with your code:良好的开始,但您的代码有几个小问题:

  • ^ and $ refer to the start and end of the string ^$指的是字符串的开始和结束
    • or end-of-line with re.MULTILINE flag enabled或启用 re.MULTILINE 标志的行尾
  • .search() returns Null or a Match object rather than the matched strings .search .search()返回NullMatch object 而不是匹配的字符串
  • you probably want the .findall() method你可能想要.findall()方法
  • if you have backslashed in your regex (which you don't yet), then you may want to use raw r"string" strings for your regex code如果你在你的正则表达式中使用了反斜杠(你还没有),那么你可能想为你的正则表达式代码使用原始的r"string"字符串
  • also think of all the possible permutations of what could be in your input data, such as HTML allowing both ' and " for quotes, and that there could be a src= attribute in something that is not an image还要考虑输入数据中可能存在的所有可能排列,例如 HTML 允许'"用于引号,并且在不是图像的东西中可能存在src=属性

Here are the docs: - https://docs.python.org/3/library/re.html#re.findall以下是文档: - https://docs.python.org/3/library/re.html#re.findall

Try this as a regex:试试这个作为正则表达式:

image_urls = re.findall(r'<img[^<>]+src=["\']([^"\'<>]+\.(?:gif|png|jpe?g))["\']', html, re.I)
print(image_urls)
>>> ['/pic/earth.jpg', '/pic/redrose.jpg']

To break this down a little:稍微分解一下:

  • re.findall() return a list of strings re.findall()返回一个字符串列表
  • <img we are looking to start in an image tag <img我们希望从图像标签开始
  • [^<>]+ 1 or more chars that don't open/close the html tag [^<>]+ 1 个或多个不打开/关闭 html 标记的字符
    • there might not be a src="" tag in the current <img>当前<img>中可能没有src=""标记
  • ["\'] the HTML could use either type of quote ["\'] HTML 可以使用任一类型的报价
  • [^"\'<>]+ keep reading 1+ chars whilst the string and the tag are not closed [^"\'<>]+在字符串和标签未关闭时继续读取 1+ 个字符
  • \. literal dots need to be escaped, else they mean the "match anything" special char文字点需要转义,否则它们意味着“匹配任何东西”特殊字符
  • (?:gif|png|jpe?g) a range of possible file extensions, but don't create a capture bracket for them (which would return these in your array) (?:gif|png|jpe?g)一系列可能的文件扩展名,但不要为它们创建捕获括号(这将在您的数组中返回这些)
  • ([^"\'<>]+\.(?:gif|png|jpe?g)) this is the capture bracket for what will actually get returned for each match ([^"\'<>]+\.(?:gif|png|jpe?g))这是每个匹配实际返回的捕获括号
  • ["\'] search for the closing quote to end the capture bracket ["\']搜索结束引号以结束捕获括号
  • re.I make the regex case insensitive re.I使正则表达式不区分大小写

I'm not good at regEx.我不擅长正则表达式。 So my answer may not be best.所以我的回答可能不是最好的。

Try this.尝试这个。

x = re.findall(r'(?=src)src=\"(?P<src>[^\"]+)', html)

than you can see x like below.比你可以看到下面的 x 。

['/pic/earth.jpg', '/pic/redrose.jpg']

RegEx explanation:正则表达式解释:

(?=src): positive lookup --> only see those have src word (?=src): 正向查找 --> 只看到那些有src字的

src=\": must include this specific word src=" src=\": 必须包含这个特定的单词src="

(?P somthing): this expression grouping somthing to name src (?P somthing):这个表达式将 somthing 分组命名为src

[^\"]+: everything except " character [^\"]+: 除了 " 字符之外的所有内容

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM