简体   繁体   English

BeautifulSoup:查找具有相同值的多个属性类型

[英]BeautifulSoup: find multiple attribute types with the same value

Is there a way to use bs4 to search for multiple attribute types with the same value?有没有办法使用 bs4 搜索具有相同值的多个属性类型?

I am scraping meta tags from news articles in order to get information like the title, author, and data published.我正在从新闻文章中抓取元标记,以获取标题、作者和发布的数据等信息。 There is some variation in how this data is structured between sites, and I would like to use the most compact code possible to cover the known possibilites.这些数据在站点之间的结构方式存在一些差异,我想使用最紧凑的代码来覆盖已知的可能性。

For example the title could be in any of:例如,标题可以是以下任何一种:

<meta content="Title of the article" property="og:title"/>
<meta content="Title of the article" property="title"/>
<meta name="Title of the article" property="og:title"/>
<meta name="Title of the article" property="title"/>

I can do something like this:我可以做这样的事情:

try:
    soup.find('meta', {'property' : re.compile('title')})['content']
except:
    soup.find('name', {'property' : re.compile('title')})['content']

But it would be nice if I could do something like this:但如果我能做这样的事情会很好:

## No result returned
soup.find('meta', {re.compile('property|name') : re.compile('title')})

## TypeError: unhashable type: 'list'
soup.find('meta', {['property','name'] : re.compile('title')})

Is there something along these lines that would work?这些方面有什么可行的吗?

As far as I understand, you want to find more than 1 object with the same folder name in the html code.据我了解,您想在 html 代码中找到多个具有相同文件夹名称的 object。

content_metas = soup.find_all("meta", {"content": "Title of the article"})



name_metas = soup.find_all("meta", {"name": "Title of the article"})

Main challenge is that attribute naming can vary, so there should be a check against the valid values of a list ['name','title','content','...'] , which can be outsourced to a separate function.主要挑战是属性命名可能会有所不同,因此应该检查列表['name','title','content','...']的有效值,可以将其外包给单独的 function .

Selecting only the <meta> with property containing title I go with css selectors :仅选择具有包含标题I go 和css selectors的属性的<meta>

soup.select_one('meta[property*="title"]')

Pushing the element into a function and iterate over its attributes, while checking if they match the possible names:将元素推入 function 并迭代其属性,同时检查它们是否与可能的名称匹配:

def get_title(e):
    for a in e.attrs:
        if a in ['name','content']:
            return e.get(a)

title = get_title(soup.select_one('meta[property*="title"]'))

The following example should illustrate how the whole thing could also be implemented on the basis of a list comprehension.下面的例子应该说明如何在列表理解的基础上实现整个事情。 Since the news page will probably only contain one of the combinations, the result would be a list with exactly one or no element, depending on whether the attribute is present or not.由于新闻页面可能只包含其中一种组合,因此结果将是一个只有一个元素或没有元素的列表,具体取决于该属性是否存在。

from bs4 import BeautifulSoup
html='''
<meta content="Title of the article" property="og:title"/>
<meta content="Title of the article" property="title"/>
<meta name="Title of the article" property="og:title"/>
<meta name="Title of the article" property="title"/>
<meta title="Title of the article" property="title"/>
'''

soup = BeautifulSoup(html)

[t.get(a) for t in soup.select('meta[property*="title"]') for a in t.attrs if a in ['name','title','content']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM