[英]Python regex match but not include characters beautiful soup
I am using beautiful soup and requests to put down information from a webpage, I am trying to get a list of book titles that are just the titles and do not include the text title= in font of the title. 我正在使用漂亮的汤,并要求从网页上放下信息,我试图获取只是标题的书名列表,并且不包括标题字体中的文本title =。
Example text = 'a bunch of junk title=book1 more junk text title=book2' 示例文本='一堆垃圾标题= book1更多垃圾文本title = book2'
what I am getting is titleList = ['title=book1', 'title=book2'] 我得到的是titleList = ['title = book1','title = book2']
I want titleList = ['book1', 'book2'] 我想要titleList = ['book1','book2']
I have tried matching groups and that does break the words title= and book1 apart but I am not sure how to append just group(2) to the list. 我尝试过匹配组,但确实将单词title =和book1分开了,但是我不确定如何仅将group(2)追加到列表中。
titleList = []
def getTitle(productUrl):
res = requests.get(productUrl, headers=headers)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'lxml')
title = re.compile(r'title=[A-Za-z0-9]+')
findTitle = title.findall(res.text.strip())
titleList.append(findTitle)
Your regex has no capture groups. 您的正则表达式没有捕获组。 You should also note that findall
returns a list so you should use extend
instead of append
(unless you want titleList
to be a list of lists). 您还应该注意, findall
返回一个列表,因此您应该使用extend
而不是append
(除非您希望titleList
是列表的列表)。
title = re.compile(r'title=([A-Za-z0-9]+)') # note parenthesis
findTitle = title.findall(res.text.strip())
titleList.extend(findTitle) # using extend and not append
A stand-alone example: 一个独立的示例:
import re
titleList = []
text = 'a bunch of junk title=book1 more junk text title=book2'
title = re.compile(r'title=([A-Za-z0-9]+)')
findTitle = title.findall(text.strip())
titleList.extend(findTitle)
print(titleList)
>> ['book1', 'book2']
Using re.findall
with a capture group will do it: 对捕获组使用re.findall
可以做到这一点:
>>> import re
>>> text = 'a bunch of junk title=book1 more junk text title=book2'
>>> re.findall(r'title=(\S+)', text)
['book1', 'book2']
>>>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.