简体   繁体   English

正则表达式,略过几句话

[英]Regular expression with skipping a few words

I am struggling to just get the text in between the quotes in the 'alt' tag. 我正在努力使文本仅位于'alt'标记中的引号之间。 I have been trying regular expressions like [!?border="0"] to skip over it but still won't work. 我一直在尝试像[!?border =“ 0”]这样的正则表达式跳过它,但仍然无法正常工作。

I have tried \\s(border="0")\\s(alt=").*?" 我尝试过\\s(border="0")\\s(alt=").*?" but it highlights over the 'border' tag 但突出显示了“边框”标签

Here's the text that i'm trying to extract from using regex 这是我尝试从使用正则表达式中提取的文本

<img src="http://www.ebgames.com.au/0141/169/5.png"alt="Far Cry 3" title=" Far Cry 3 " class="photo"/>            </a>

I am just trying to extract the text in between the quotes of the alt tag. 我只是试图提取alt标记的引号之间的文本。 Extracting the title would probably be better if possible. 如果可能的话,提取标题可能会更好。 Please help, thank you 请帮忙,谢谢

Try this regex: 试试这个正则表达式:

border=\"0\" alt=\"(.*?)\"

Demo: https://regex101.com/r/1kbiBv/1/ 演示: https : //regex101.com/r/1kbiBv/1/

You could also implement Positive Look-ahead, and Positive Look-behind to catch only what is between quotes: 您还可以实现“正向向前看”和“正向向前看”以仅捕获引号之间的内容:

(?<=border=\"0\" alt=\").*?(?=\")

Demo: https://regex101.com/r/1kbiBv/2/ 演示: https//regex101.com/r/1kbiBv/2/

There is better way to extract html element and attribute with BeautifulSoup : 有一个更好的方法来用BeautifulSoup提取html元素和属性:

from bs4 import BeautifulSoup
div_test='<img src="http://rcdn-1.fishpond.com.au/0141/169/297/319967448/5.jpeg" border="0" alt="The Durrells: Series 2" title=" The Durrells: Series 2 " class="photo"/> '
soup = BeautifulSoup(div_test, "lxml")
result = soup.find("img").get('alt')
result

Output: 输出:

'The Durrells: Series 2'

You can use a lambda in order to extract your tags from your current input. 您可以使用lambda以便从当前输入中提取标签。

You can try this code: 您可以尝试以下代码:

import re

a = '''<img src="http://rcdn-1.fishpond.com.au/0141/169/297/319967448/5.jpeg" border="0" alt="The Durrells: Series 2" title=" The Durrells: Series 2 " class="photo"/>            </a>
'''

find_tag = lambda x: r'{0}="(.*?)"'.format(x)
# Same as doing:
# regex = re.compile(find_tag('border="0" alt'))
regex = re.compile(find_tag("alt"))
text = re.findall(regex, a)
print(text)

Output: 输出:

['The Durrells: Series 2']

Also, this code will work with the other tags as well, for example: 同样,此代码也可以与其他标签一起使用,例如:

regex = re.compile(find_tag("src"))
# Same as doing:
# regex = re.compile(find_tag('<img src'))
text = re.findall(regex, a)
print(text)

Output: 输出:

['http://rcdn-1.fishpond.com.au/0141/169/297/319967448/5.jpeg']

I think re.search with a simple regex works. 我认为re.search一个简单的regex作品。

import re
s = '<img src="himg src="http://www.ebgames.com.au/0141/169/5.png" border="0" alt="Far Cry 3" title=" Far Cry 3 " class="photo"/>            </a>'
pat = 'alt="([^"]*)".* title="([^"]*)".*"'
a = re.search(pat, s)
print(a[1]) # content in the alt tag : "Far Cry 3"
print(a[2]) # content in the alt title : "Far Cry 3"

This code finds what you need, using this pattern: 'alt=".*?"' . 此代码使用以下模式找到所需的内容: 'alt=".*?"'

 import re

 w ='<img src="http://rcdn-1.fishpond.com.au/0141/169/297/319967448/5.jpeg" border="0" alt="The 
 Durrells: Series 2" title=" The Durrells: Series 2 " class="photo"/>   </a>'

 pattern = 'alt=".*?"'
 m = re.findall(pattern, w)
 print(m)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM