简体   繁体   English

BeautifulSoup如何获取跨度内的内容?

[英]how BeautifulSoup get the content inside a span?

I'm trying to parse fixture contents from a website I managed to parse Match column but facing difficulty in parsing date and time column. 我正在尝试从设法解析“匹配”列的网站解析灯具的内容,但是在解析日期和时间列时遇到了困难。

My program 我的程序

import re
import pytz
import requests
import datetime
from bs4 import BeautifulSoup
from espncricinfo.exceptions import MatchNotFoundError, NoScorecardError
from espncricinfo.match import Match

bigbash_article_link = "http://www.espncricinfo.com/ci/content/series/1128817.html?template=fixtures"

r = requests.get(bigbash_article_link)
bigbash_article_html = r.text

soup = BeautifulSoup(bigbash_article_html, "html.parser")


bigbash1_items = soup.find_all("span",{"class": "fixture_date"})
bigbash_items = soup.find_all("span",{"class": "play_team"})
bigbash_article_dict = {}
date_dict = {}

for div in bigbash_items:
    a = div.find('a')['href']
    bigbash_article_dict[div.find('a').string] = a
print(bigbash_article_dict)
for div in bigbash1_items:
    a = div.find('span').string
    date_dict[div.find('span').string] = a
print(date_dict)

When I execute this I get print(bigbash_article_dict) output, but print(date_dict) gives me error, how can I parse date and time content? 执行此操作时,我得到print(bigbash_article_dict)输出,但是print(date_dict)给我错误,我该如何解析日期和时间内容?

Follow your code, you want to get the content inside the tag span. 按照您的代码,您想要在标签范围内获取内容。 So you should using "div.contents" to get the contents of span. 因此,您应该使用“ div.contents”来获取span的内容。

And your question should be how BeautifulSoup get the content inside a span. 您的问题应该是BeautifulSoup如何获得跨度内的内容。

eg.
    div= <span class="fixture_date">
    Thu Feb 22
                            </span>
    div.contents[0].strip()= Thu Feb 22 
    ------------



for div in bigbash1_items:
        print("div=",div)    
        print("div.contents[0].strip()=",div.contents[0].strip(),"\r\n------------\r\n")

Elements with class fixture_date don't have a <span> , they are the span. 带有fixture_date类的元素没有<span> ,它们是跨度。 You can get the data from them directly. 您可以直接从他们那里获取数据。

So instead of this: 所以代替这个:

div.find('span').string

You can do this: 你可以这样做:

div.string

From the structure of the website, this would return the date on odd iterations (1, 3, ..) and time on even iterations (2, 4, ..). 从网站的结构来看,这将返回奇数次迭代(1、3,..)的日期和偶数次迭代(2、4,..)的时间。

Oh and I'd advice you to make the variable name meaningful, so rename div to span . 哦,我建议您使变量名有意义,因此将div重命名为span
Because in your code, all div variables actually contain <span> tags ;) 因为在您的代码中,所有div变量实际上都包含<span>标记;)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM