Python Beautiful Soup提取HTML元数据

Question

我得到了一些我不太了解的奇怪行为。 我希望有人能解释发生了什么。

考虑以下元数据：

<meta property="og:title" content="This is the Tesla Semi truck">
<meta name="twitter:title" content="This is the Tesla Semi truck">

此行成功找到所有“ og”属性并返回列表。

opengraphs = doc.html.head.findAll(property=re.compile(r'^og'))

但是，此行无法对twitter卡执行相同的操作。

twitterCards = doc.html.head.findAll(name=re.compile(r'^twitter'))

为什么第一行成功找到所有“ og”（Opengraph卡），但找不到推特卡？

Answer 1

这是因为name是标签名称参数的名称，这基本上意味着在这种情况下， BeautifulSoup将查找标签名称以twitter开头的元素。

为了指定您实际上是指属性，请使用：

doc.html.head.find_all(attrs={'name': re.compile(r'^twitter')})

或者，通过CSS选择器：

doc.html.head.select("[name^=twitter]")

其中^=表示“开头为”。

Answer 2

问题是name= ，它具有特殊含义。 它用于查找标签名称-在您的代码中是meta

您必须添加"meta"并使用带有"name"字典

带有不同项目的示例。

from bs4 import BeautifulSoup
import re

data='''
<meta property="og:title" content="This is the Tesla Semi truck">
<meta property="twitter:title" content="This is the Tesla Semi truck">
<meta name="twitter:title" content="This is the Tesla Semi truck">
'''

head = BeautifulSoup(data)

print(head.findAll(property=re.compile(r'^og'))) # OK
print(head.findAll(property=re.compile(r'^tw'))) # OK

print(head.findAll(name=re.compile(r'^meta'))) # OK
print(head.findAll(name=re.compile(r'^tw')))   # empty

print(head.findAll('meta', {'name': re.compile(r'^tw')})) # OK

Python Beautiful Soup提取HTML元数据

问题描述

2 个解决方案

解决方案1
3 2017-12-17 05:34:23

解决方案2
2 已采纳 2017-12-17 05:35:28

Python Beautiful Soup提取HTML元数据

问题描述

2 个解决方案

解决方案1 3 2017-12-17 05:34:23

解决方案2 2 已采纳 2017-12-17 05:35:28

解决方案1
3 2017-12-17 05:34:23

解决方案2
2 已采纳 2017-12-17 05:35:28