简体   繁体   English

正则表达式查找字符串python

[英]Regex to find a string python

I have a string 我有一个弦

<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />

What is the Regex to find ABCDXYZ in Python 什么是在Python中找到ABCDXYZ的正则表达式

Don't use regex to parse HTML. 不要使用正则表达式来解析HTML。 Use BeautifulSoup . 使用BeautifulSoup

from bs4 import BeautifulSoup as BS
text = '''<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />'''
soup = BS(text)
print soup.find('img').attrs['alt']

If you're looking for the value of that alt attribute, you can do this: 如果要查找该alt属性的值,则可以执行以下操作:

>>> r = r'alt="(.*?)"'

Then: 然后:

>>> m = re.search(r, mystring)
>>> m.group(1)
'ABCDXYZ'

And you can use re.findall if you want to find more than one. 如果您想查找多个re.findall则可以使用re.findall

However, this code will be easily fooled by something like this: 但是,此代码很容易被类似以下内容所欺骗:

<span>Here's some text explaining how to do alt="foo" in an img tag.</span>

On the other hand, it'll also fail to pick up something like this: 另一方面,它也将无法拾取如下内容:

<img src='/p_img/411/123411/639469aa9f_123411_100.jpg' alt='ABCDXYZ' />

How do you deal with that? 你怎么处理那件事呢? The short answer is: You don't. 简短的答案是:您不会。 XML and HTML are not regular languages. XML和HTML不是常规语言。

It's worth backing up here to point out that Python's re engine is not actually a true regular expression engine—and, on top of that, it's embedded in a Turing-complete programming language. 值得在此指出的是,Python的re引擎实际上并不是真正的正则表达式引擎-最重要的是,它已嵌入图灵完整的编程语言中。 So obviously it is possible to build an HTML parser around Python and re . 因此,显然可以围绕Python和re构建HTML解析器。 This answer shows part of a parser written in perl , where regexes do most of the heavy lifting. 这个答案显示了用perl编写的解析器的一部分,其中正则表达式完成了大部分繁重的工作。 But that doesn't mean you should do it this way. 但这并不意味着您应该这样做。 You shouldn't be writing a parser in the first place, given that perfectly good ones already exist, and if you did, you shouldn't be forcing yourself to use regexes even when there's an easier way to do what you want. 鉴于已经存在完美的解析器,因此您不应该首先编写解析器;如果确实存在,那么即使有更简单的方法来执行所需的操作,也不应强迫自己使用正则表达式。 For quick&dirty playing around, regex is fine. 对于快速和肮脏的游戏,正则表达式很好。 For a production program, it's almost always the wrong answer. 对于生产程序,几乎总是错误的答案。

One way to convince your boss to let you use a parser is by crafting a suite of tests that are all obviously valid, and that cannot possibly be handled by any regex-based solution short of a full parser. 要说服老板让您使用解析器,一种方法是设计一套显然有效的测试,而缺少完整解析器的任何基于正则表达式的解决方案都无法处理这些测试。 If you can come up with a test that can be parsed, but only using exponential backtracking, and therefore takes 12 hours with regex vs. 0.1 seconds with bs4, even better, but that's a bit trickier… 如果您能提出一个可以解析的测试,但是仅使用指数回溯,那么使用正则表达式需要12个小时,而使用bs4则需要0.1秒,甚至更好,但这有点棘手……

Of course it's also worth looking for articles online (and SO questions like this and this and the 300 other dups) and picking the best ones to show your boss. 当然,这也值得寻找网上的文章(等等之类的问题这个这个和其他300名的DUP),并挑选最好的,以显示你的老板。

If you really can't convince your boss otherwise, then you're done at this point. 如果您真的不能说服老板,那么到此为止。 Given what's been specified, this works. 鉴于已指定的内容,此方法可行。 Given what may or may not actually be intended, nothing short of mind-reading will work. 给定实际可能想要或可能没有的意图,没有什么念头就行不通。 As you find more and more real-life cases that fail, you can hack it up by adding more and more complex alternations and/or context onto the regex itself, or possibly use a series of regexes and post-filters, until finally you get sick of it and find yourself a better job. 当您发现越来越多的现实案例失败时,您可以通过在正则表达式本身上添加越来越复杂的变更和/或上下文来破解它,或者可能使用一系列正则表达式和后过滤器,直到最终获得厌倦了它,找到了一份更好的工作。

First, a disclaimer: You shouldn't be using regular expressions to parse HTML . 首先,免责声明:您不应该使用正则表达式来解析HTML You can use BeautifulSoup for this 您可以为此使用BeautifulSoup

Next, if you are actually serious about using regular expressions and the above is the exact case you want then you could do something like: 接下来,如果您实际上对使用正则表达式很认真,并且上面是您想要的确切情况,则可以执行以下操作:

<a href="[a-zA-Z0-9/]+"><img src="[a-zA-Z0-9/]+" alt="([a-zA-Z0-9/]+)" />

and you could access the text via the match object's groups attribute. 您可以通过match对象的groups属性访问文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM