I have a variable like the one below:
var = '<img src="path_1"><p>Words</p><img src="path_2>'
Its a string, but inside is obviously html elements. How do I get the first path only (ie path_1) using a regex?
I am trying something like this:
match = re.match(r'src=\"[\w-]+\"', var)
print match.group(0)
I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
Any help is appreciated.
You should use an HTML parser like BeautifulSoup
:
>>> from bs4 import BeautifulSoup
>>> var = '<img src="path_1"><p>Words</p><img src="path_2>'
>>> soup = BeautifulSoup(var, "html.parser")
>>> soup.img["src"]
'path_1'
As for the regex-approach, you need to make the following changes to make it work:
re.search()
, re.match()
starts matching from the beginning of the string src
value Fixed version:
>>> re.search(r'src="([\w-]+)"', var).group(1)
'path_1'
As suggested in the comments, use search()
since match()
will try to match your regular expression from the beginning of the string. You can also use capturing a named group to make the code more readable:
var = '<img src="path_1"><p>Words</p><img src="path_2>'
import re
match = re.search(r'src=\"(?P<path1>[\w-]+)\"', var)
if match:
print(match.group('path1'))
Output:
path_1
Try,
path1= re.search(r'<img\s+src="(.*?)"><p>',var).group(1) # path_1
BeutifulSoup
is convenient. But very slow.
HTMLParser
is a lot faster. But using it is painful.
re
is the fastest option and in my opinion, for stateless usecases it's worth it.
If the target text is stateful , ie lots of nesting and capturing the semantics is important, instead of implementing a state machine eg a parser use an available parser. I would strongly suggest lxml for parsing HTML and XML. It is a little bit less convenient than bs4
but comparable to re
in speed.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.