Python正则表达式匹配但不包含漂亮的字符汤

Question

I am using beautiful soup and requests to put down information from a webpage, I am trying to get a list of book titles that are just the titles and do not include the text title= in font of the title. 我正在使用漂亮的汤，并要求从网页上放下信息，我试图获取只是标题的书名列表，并且不包括标题字体中的文本title =。

Example text = 'a bunch of junk title=book1 more junk text title=book2' 示例文本='一堆垃圾标题= book1更多垃圾文本title = book2'

what I am getting is titleList = ['title=book1', 'title=book2'] 我得到的是titleList = ['title = book1'，'title = book2']

I want titleList = ['book1', 'book2'] 我想要titleList = ['book1'，'book2']

I have tried matching groups and that does break the words title= and book1 apart but I am not sure how to append just group(2) to the list. 我尝试过匹配组，但确实将单词title =和book1分开了，但是我不确定如何仅将group（2）追加到列表中。

titleList = []

def getTitle(productUrl):

  res = requests.get(productUrl, headers=headers)
  res.raise_for_status()

  soup = bs4.BeautifulSoup(res.text, 'lxml')
  title = re.compile(r'title=[A-Za-z0-9]+')
  findTitle = title.findall(res.text.strip())
  titleList.append(findTitle)

Answer 1

Your regex has no capture groups. 您的正则表达式没有捕获组。 You should also note that findall returns a list so you should use extend instead of append (unless you want titleList to be a list of lists). 您还应该注意， findall返回一个列表，因此您应该使用extend而不是append （除非您希望titleList是列表的列表）。

title = re.compile(r'title=([A-Za-z0-9]+)')   # note parenthesis
findTitle = title.findall(res.text.strip())
titleList.extend(findTitle)   # using extend and not append

A stand-alone example: 一个独立的示例：

import re

titleList = []
text = 'a bunch of junk title=book1 more junk text title=book2'

title = re.compile(r'title=([A-Za-z0-9]+)') 
findTitle = title.findall(text.strip())
titleList.extend(findTitle) 
print(titleList)
>> ['book1', 'book2']

Answer 2

Using re.findall with a capture group will do it: 对捕获组使用re.findall可以做到这一点：

>>> import re
>>> text = 'a bunch of junk title=book1 more junk text title=book2'
>>> re.findall(r'title=(\S+)', text)
['book1', 'book2']
>>>

Python正则表达式匹配但不包含漂亮的字符汤

问题描述

2 个解决方案

解决方案1
4 已采纳 2016-12-12 14:37:14

解决方案2
1 2016-12-12 14:55:44

Python正则表达式匹配但不包含漂亮的字符汤

问题描述

2 个解决方案

解决方案1 4 已采纳 2016-12-12 14:37:14

解决方案2 1 2016-12-12 14:55:44

解决方案1
4 已采纳 2016-12-12 14:37:14

解决方案2
1 2016-12-12 14:55:44