简体   繁体   中英

Regex to find a string python

I have a string

<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />

What is the Regex to find ABCDXYZ in Python

Don't use regex to parse HTML. Use BeautifulSoup .

from bs4 import BeautifulSoup as BS
text = '''<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />'''
soup = BS(text)
print soup.find('img').attrs['alt']

If you're looking for the value of that alt attribute, you can do this:

>>> r = r'alt="(.*?)"'

Then:

>>> m = re.search(r, mystring)
>>> m.group(1)
'ABCDXYZ'

And you can use re.findall if you want to find more than one.

However, this code will be easily fooled by something like this:

<span>Here's some text explaining how to do alt="foo" in an img tag.</span>

On the other hand, it'll also fail to pick up something like this:

<img src='/p_img/411/123411/639469aa9f_123411_100.jpg' alt='ABCDXYZ' />

How do you deal with that? The short answer is: You don't. XML and HTML are not regular languages.

It's worth backing up here to point out that Python's re engine is not actually a true regular expression engine—and, on top of that, it's embedded in a Turing-complete programming language. So obviously it is possible to build an HTML parser around Python and re . This answer shows part of a parser written in perl , where regexes do most of the heavy lifting. But that doesn't mean you should do it this way. You shouldn't be writing a parser in the first place, given that perfectly good ones already exist, and if you did, you shouldn't be forcing yourself to use regexes even when there's an easier way to do what you want. For quick&dirty playing around, regex is fine. For a production program, it's almost always the wrong answer.

One way to convince your boss to let you use a parser is by crafting a suite of tests that are all obviously valid, and that cannot possibly be handled by any regex-based solution short of a full parser. If you can come up with a test that can be parsed, but only using exponential backtracking, and therefore takes 12 hours with regex vs. 0.1 seconds with bs4, even better, but that's a bit trickier…

Of course it's also worth looking for articles online (and SO questions like this and this and the 300 other dups) and picking the best ones to show your boss.

If you really can't convince your boss otherwise, then you're done at this point. Given what's been specified, this works. Given what may or may not actually be intended, nothing short of mind-reading will work. As you find more and more real-life cases that fail, you can hack it up by adding more and more complex alternations and/or context onto the regex itself, or possibly use a series of regexes and post-filters, until finally you get sick of it and find yourself a better job.

First, a disclaimer: You shouldn't be using regular expressions to parse HTML . You can use BeautifulSoup for this

Next, if you are actually serious about using regular expressions and the above is the exact case you want then you could do something like:

<a href="[a-zA-Z0-9/]+"><img src="[a-zA-Z0-9/]+" alt="([a-zA-Z0-9/]+)" />

and you could access the text via the match object's groups attribute.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM