简体   繁体   English

Python正则表达式匹配但不包含漂亮的字符汤

[英]Python regex match but not include characters beautiful soup

I am using beautiful soup and requests to put down information from a webpage, I am trying to get a list of book titles that are just the titles and do not include the text title= in font of the title. 我正在使用漂亮的汤,并要求从网页上放下信息,我试图获取只是标题的书名列表,并且不包括标题字体中的文本title =。

Example text = 'a bunch of junk title=book1 more junk text title=book2' 示例文本='一堆垃圾标题= book1更多垃圾文本title = book2'

what I am getting is titleList = ['title=book1', 'title=book2'] 我得到的是titleList = ['title = book1','title = book2']

I want titleList = ['book1', 'book2'] 我想要titleList = ['book1','book2']

I have tried matching groups and that does break the words title= and book1 apart but I am not sure how to append just group(2) to the list. 我尝试过匹配组,但确实将单词title =和book1分开了,但是我不确定如何仅将group(2)追加到列表中。

titleList = []

def getTitle(productUrl):

  res = requests.get(productUrl, headers=headers)
  res.raise_for_status()

  soup = bs4.BeautifulSoup(res.text, 'lxml')
  title = re.compile(r'title=[A-Za-z0-9]+')
  findTitle = title.findall(res.text.strip())
  titleList.append(findTitle)

Your regex has no capture groups. 您的正则表达式没有捕获组。 You should also note that findall returns a list so you should use extend instead of append (unless you want titleList to be a list of lists). 您还应该注意, findall返回一个列表,因此您应该使用extend而不是append (除非您希望titleList是列表的列表)。

title = re.compile(r'title=([A-Za-z0-9]+)')   # note parenthesis
findTitle = title.findall(res.text.strip())
titleList.extend(findTitle)   # using extend and not append

A stand-alone example: 一个独立的示例:

import re

titleList = []
text = 'a bunch of junk title=book1 more junk text title=book2'

title = re.compile(r'title=([A-Za-z0-9]+)') 
findTitle = title.findall(text.strip())
titleList.extend(findTitle) 
print(titleList)
>> ['book1', 'book2']

Using re.findall with a capture group will do it: 对捕获组使用re.findall可以做到这一点:

>>> import re
>>> text = 'a bunch of junk title=book1 more junk text title=book2'
>>> re.findall(r'title=(\S+)', text)
['book1', 'book2']
>>>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM