简体   繁体   English

BeautifulSoup Python 提取带有属性的特定标签的标签标题

[英]BeautifulSoup Python Extracting Tag Title For Specific Tags With Attribute

I'm working on a scraper using beautifulsoup to pull concert information for certain artists on songkick.我正在使用beautifulsoup 为Songkick 上的某些艺术家提取音乐会信息。 the url I'm working with is here https://www.songkick.com/metro-areas/17835-us-los-angeles-la/february-2020?page=1 .我正在使用的网址在这里https://www.songkick.com/metro-areas/17835-us-los-angeles-la/february-2020?page=1 I've been able to extract all artist, venue, city, and state info, the only thing I'm having trouble with is extracting the date of concerts.我已经能够提取所有艺术家、场地、城市和州信息,我唯一遇到的问题是提取音乐会的日期。

In looking at the html elements, I see that the dates for shows are listed as the li title="Saturday 01 February 2020" values for example the children under ul class="event-listings".在查看 html 元素时,我看到节目的日期被列为 li title="Saturday 01 February 2020" 值,例如 ul class="event-listings" 下的孩子。 A method I was attempting to perform was extracting the time datetime values that are nensted under the li titles, but my output included the entire html markup for each li time datetime instead of just the datetime.我试图执行的一种方法是提取 li 标题下的时间日期时间值,但我的输出包括每个 li 时间日期时间的整个 html 标记,而不仅仅是日期时间。 I'm looking to either extract the li titles or the time datetime values.我正在寻找要么提取 li 标题或时间日期时间值。 These li's don't have a class either.这些李也没有课。

Here is some of my code这是我的一些代码

import requests
from bs4 import BeautifulSoup as bs4

pages=[]
artists=[]
venues=[]
dates=[]
cities=[]
states=[]

pages_to_scrape=1

for i in range(1, pages_to_scrape+1):
    url = 'https://www.songkick.com/metro-areas/17835-us-los-angeles-la/february-2020?page={}'.format(i)
    pages.append(url)
for item in pages:
    page = requests.get(item)
    soup = bs4(page.text, 'html.parser')
    for m in soup.findAll('li', title=True):
        date = m.find('time')
        print(date)

Output:输出:

<time datetime="2020-02-01T20:00:00-0800"></time>
<time datetime="2020-02-01T20:00:00-0800"></time>
<time datetime="2020-02-01T19:00:00-0800"></time>
<time datetime="2020-02-01T19:00:00-0800"></time>
<time datetime="2020-02-01T21:00:00-0800"></time>
etc...

Looking for output like this:寻找这样的输出:

2020-02-01
2020-02-01
2020-02-01
etc...

Or if able to grab the title values of li's some how output like this:或者,如果能够获取 li 的标题值,则输出如下:

Saturday 01 February 2020
Saturday 01 February 2020
Saturday 01 February 2020
Saturday 01 February 2020
etc...

I'm curious if I'm able to split at the " for the time datetime, but since it's not text I don't think that's possible. Also, I don't want to grab the first li class = "with-date" as that is just the headline for dates for the page as to why I'm not just grabbing all li's.我很好奇我是否能够在日期时间的 " 拆分,但由于它不是文本,我认为这是不可能的。另外,我不想抢第一个 li class = "with-date “因为这只是页面日期的标题,为什么我不只是抓住所有 li。

尝试m.find('time')['datetime']而不是m.find('time')

Here's a way to achieve this:这是实现这一目标的方法:

import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.songkick.com/metro-areas/17835-us-los-angeles-la/february-2020?page=1")
soup = BeautifulSoup(p.content, "html.parser")
tags = soup.find_all("time")
[t["datetime"].split("T")[0] for t in tags]

Notes:笔记:

  1. I'm quite sure that crawling Songkick in this way violates their terms and conditions.我很确定以这种方式抓取 Songkick 违反了他们的条款和条件。
  2. You might consider using their API, which works well: https://www.songkick.com/developer您可能会考虑使用他们的 API,效果很好: https : //www.songkick.com/developer

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM