[英]Grabbing meta content with beautiful soup
Do I need to use regex here? 我需要在这里使用正则表达式吗?
The content I want looks like: 我想要的内容如下所示:
<meta content="text I want to grab" name="description"/>
However, there are many objects that start with "meta content=" I want the one that ends in name="description". 但是,有许多以“ meta content =“开头的对象,我想要以name =” description“结尾的对象。 I'm pretty new at regex, but I thought BS would be able to handle this.
我是regex的新手,但我认为BS可以解决这个问题。
Assuming you were able read the HTML contents into a variable and named the variable html
, you have to parse the HTML using beautifulsoup: 假设您能够将HTML内容读入一个变量并将其命名为
html
,则必须使用beautifulsoup解析HTML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
Then, to search for <meta content="text I want to grab" name="description"/>
, you have to find a tag with name 'meta'
and attribute name='description'
: 然后,要搜索
<meta content="text I want to grab" name="description"/>
,必须找到名称为'meta'
且属性name='description'
的标签:
def is_meta_description(tag):
return tag.name == 'meta' and tag['name'] == 'description'
meta_tag = soup.find(is_meta_description)
You are trying to fetch the content
attribute of the tag, so: 您正在尝试获取标签的
content
属性,因此:
content = meta_tag['content']
Since it is a simple search, there is also a simpler way to find the tag: 由于这是一个简单的搜索,因此还有一种更简单的方法来找到标签:
meta_tag = soup.find('meta', attrs={'name': 'description'})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.