Beautiful Soup 找不到第一个标签 (XML)

Question

我正在使用 BeautifulSoup 4（和解析器 lmxl）来解析用于MLB API的 XML 文件。 API 为特定日期的当前游戏生成记分牌，我无法让 Beautiful Soup 识别特定选项卡。

例如，我正在查看今天的比赛，试图根据某个球队的away_file_code或home_file_code提取他们的分数和名称。 如果我们看看巴尔的摩金莺队对多伦多蓝鸟队的比赛，比赛记分牌 XML 将如下所示：

<games year="2017" month="04" day="16" modified_date="2017-04-17T01:42:57Z" next_day_date="2017-04-17">
<game id="2017/04/16/balmlb-tormlb-1" venue="Rogers Centre" game_pk="490271" time="1:07" time_date="2017/04/16 1:07" time_date_aw_lg="2017/04/16 1:07" time_date_hm_lg="2017/04/16 1:07" time_zone="ET" ampm="PM" first_pitch_et="" away_time="1:07" away_time_zone="ET" away_ampm="PM" home_time="1:07" home_time_zone="ET" home_ampm="PM" game_type="R" tiebreaker_sw="N" resume_date="" original_date="2017/04/16" time_zone_aw_lg="-4" time_zone_hm_lg="-4" time_aw_lg="1:07" aw_lg_ampm="PM" tz_aw_lg_gen="ET" time_hm_lg="1:07" hm_lg_ampm="PM" tz_hm_lg_gen="ET" venue_id="14" scheduled_innings="9" description="" away_name_abbrev="BAL" home_name_abbrev="TOR" away_code="bal" away_file_code="bal" away_team_id="110" away_team_city="Baltimore" away_team_name="Orioles" away_division="E" away_league_id="103" away_sport_code="mlb" home_code="tor" home_file_code="tor" home_team_id="141" home_team_city="Toronto" home_team_name="Blue Jays" home_division="E" home_league_id="103" home_sport_code="mlb" day="SUN" gameday_sw="P" double_header_sw="N" game_nbr="1" tbd_flag="N" away_games_back="-" home_games_back="6.5" away_games_back_wildcard="" home_games_back_wildcard="5.5" venue_w_chan_loc="CAXX0504" location="Toronto, Canada" gameday="2017_04_16_balmlb_tormlb_1" away_win="8" away_loss="3" home_win="2" home_loss="10" game_data_directory="/components/game/mlb/year_2017/month_04/day_16/gid_2017_04_16_balmlb_tormlb_1" league="AA">
<status status="Final" ind="F" reason="" inning="9" top_inning="N" b="0" s="0" o="3" inning_state="" note="" is_perfect_game="N" is_no_hitter="N"/>
<linescore>...</linescore>
<home_runs>...</home_runs>
<winning_pitcher id="605164" last="Bundy" first="Dylan" name_display_roster="Bundy" number="37" era="1.86" wins="2" losses="1"/>
<losing_pitcher id="457918" last="Happ" first="J.A." name_display_roster="Happ" number="33" era="4.50" wins="0" losses="3"/>
<save_pitcher id="" last="" first="" number="" name_display_roster="" era="0" wins="0" losses="0" saves="0" svo="0"/>
<links mlbtv="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'video'})" wrapup="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=wrap&c_id=mlb" home_audio="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'audio'})" away_audio="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'audio'})" home_preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" away_preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" tv_station="SNET-1"/>
<broadcast>...</broadcast>
<alerts text="Final score in Toronto: Baltimore 11, Toronto 4" brief_text="At TOR: Final - BAL 11, TOR 4" type="status"/>
<game_media>...</game_media>
<video_thumbnail>...</video_thumbnail>
<video_thumbnails>...</video_thumbnails>
</game>
<game>...</game> (etc...)

下面是我用来尝试查找game （不是games ）标签及其属性的代码片段。 问题是，当我请求游戏时，它返回 None。 但是，我可以毫无问题地返回任何其他标签——例如， status工作得很好。

soup = BeautifulSoup(webpage, 'xml') # webpage is the xml file for today's games
tags = soup.findAll('game', {'home_file_code': 'tor'}) #supposed to find the tags for the home_file_code matching the home team's abbreviation
for games in tags:
    print(games.find('status')['status'] #works without an issue
    print(games.find('game')['home_file_code'] #throws below error, because games.find('game') is None

TypeError: 'NoneType' 对象不可下标

另外，如果我打印列表的子项（ print(list(games.children)) ），它会返回除游戏之外的所有内容。

关于 XML 为何无法抓取第一个标签，我是否遗漏了什么？ 我很困惑，因为不久前这对我有用，而且我不确定我更改了什么导致错误。

Answer 1

看来我误解了 find 功能。 您可以为关键字编制索引，以便在标签本身中查找所需的属性。 所以，基本上我应该做以下事情：

soup = BeautifulSoup(webpage, 'xml') # webpage is the xml file for today's games
tags = soup.findAll('game', {'home_file_code': 'tor'})
for games in tags:
    print(games.find('status')['status']
    print(games['home_file_code'])

现在print(games['home_file_code']将按预期找到home_file_code ，因为它已经存在于我们查找的标签中。

我敢肯定有人可以给出更彻底的答案，但这是我遇到的根本误解。

Answer 2

我不是最伟大的程序员，但我很确定你没有找到第一个标签，因为它定义不正确。 XML 标签（如果它们包含任何内容）必须具有如下所示的开头和结尾部分： <games>year="2017" month="04" day="16"</games>而不是这样： <games year="2017" month="04" day="16">所以首先你需要修复你的 XML 格式，然后从那里开始。

Beautiful Soup 找不到第一个标签 (XML)

问题描述

2 个解决方案

解决方案1
0 2017-04-17 02:59:18

解决方案2
0 2021-02-05 00:27:32

Beautiful Soup 找不到第一个标签 (XML)

问题描述

2 个解决方案

解决方案1 0 2017-04-17 02:59:18

解决方案2 0 2021-02-05 00:27:32

解决方案1
0 2017-04-17 02:59:18

解决方案2
0 2021-02-05 00:27:32