简体   繁体   中英

Using python regular expression to find an image path

I have a variable like the one below:

var = '<img src="path_1"><p>Words</p><img src="path_2>'

Its a string, but inside is obviously html elements. How do I get the first path only (ie path_1) using a regex?

I am trying something like this:

match = re.match(r'src=\"[\w-]+\"', var)
print match.group(0)

I get this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

Any help is appreciated.

You should use an HTML parser like BeautifulSoup :

>>> from bs4 import BeautifulSoup
>>> var = '<img src="path_1"><p>Words</p><img src="path_2>'
>>> soup = BeautifulSoup(var, "html.parser")
>>> soup.img["src"]
'path_1'

As for the regex-approach, you need to make the following changes to make it work:

  • switch to re.search() , re.match() starts matching from the beginning of the string
  • add a capturing group to capture the src value
  • there is no need to escape double quotes

Fixed version:

>>> re.search(r'src="([\w-]+)"', var).group(1)
'path_1'

As suggested in the comments, use search() since match() will try to match your regular expression from the beginning of the string. You can also use capturing a named group to make the code more readable:

var = '<img src="path_1"><p>Words</p><img src="path_2>'
import re
match = re.search(r'src=\"(?P<path1>[\w-]+)\"', var)
if match:
    print(match.group('path1'))

Output:

path_1

Try,

path1= re.search(r'<img\s+src="(.*?)"><p>',var).group(1) # path_1
  1. BeutifulSoup is convenient. But very slow.

  2. HTMLParser is a lot faster. But using it is painful.

  3. re is the fastest option and in my opinion, for stateless usecases it's worth it.

If the target text is stateful , ie lots of nesting and capturing the semantics is important, instead of implementing a state machine eg a parser use an available parser. I would strongly suggest lxml for parsing HTML and XML. It is a little bit less convenient than bs4 but comparable to re in speed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM