简体   繁体   English

Beautiful Soup 找不到第一个标签 (XML)

[英]Beautiful Soup Can't Find the First Tag (XML)

I am using BeautifulSoup 4 (and the parser lmxl) to parse an XML file used for the MLB API .我正在使用 BeautifulSoup 4(和解析器 lmxl)来解析用于MLB API的 XML 文件。 The API generates a scoreboard for the current games for a particular day, and I'm having trouble getting Beautiful Soup to recognize a particular tab. API 为特定日期的当前游戏生成记分牌,我无法让 Beautiful Soup 识别特定选项卡。

For instance, I am looking at today's games , trying to extract the scores and names for a certain team based on their away_file_code or home_file_code .例如,我正在查看今天的比赛,试图根据某个球队的away_file_codehome_file_code提取他们的分数和名称。 If we look at the Baltimore Orioles vs Toronto Blue Jays, the game scoreboard XML will look like this:如果我们看看巴尔的摩金莺队对多伦多蓝鸟队的比赛,比赛记分牌 XML 将如下所示:

<games year="2017" month="04" day="16" modified_date="2017-04-17T01:42:57Z" next_day_date="2017-04-17">
<game id="2017/04/16/balmlb-tormlb-1" venue="Rogers Centre" game_pk="490271" time="1:07" time_date="2017/04/16 1:07" time_date_aw_lg="2017/04/16 1:07" time_date_hm_lg="2017/04/16 1:07" time_zone="ET" ampm="PM" first_pitch_et="" away_time="1:07" away_time_zone="ET" away_ampm="PM" home_time="1:07" home_time_zone="ET" home_ampm="PM" game_type="R" tiebreaker_sw="N" resume_date="" original_date="2017/04/16" time_zone_aw_lg="-4" time_zone_hm_lg="-4" time_aw_lg="1:07" aw_lg_ampm="PM" tz_aw_lg_gen="ET" time_hm_lg="1:07" hm_lg_ampm="PM" tz_hm_lg_gen="ET" venue_id="14" scheduled_innings="9" description="" away_name_abbrev="BAL" home_name_abbrev="TOR" away_code="bal" away_file_code="bal" away_team_id="110" away_team_city="Baltimore" away_team_name="Orioles" away_division="E" away_league_id="103" away_sport_code="mlb" home_code="tor" home_file_code="tor" home_team_id="141" home_team_city="Toronto" home_team_name="Blue Jays" home_division="E" home_league_id="103" home_sport_code="mlb" day="SUN" gameday_sw="P" double_header_sw="N" game_nbr="1" tbd_flag="N" away_games_back="-" home_games_back="6.5" away_games_back_wildcard="" home_games_back_wildcard="5.5" venue_w_chan_loc="CAXX0504" location="Toronto, Canada" gameday="2017_04_16_balmlb_tormlb_1" away_win="8" away_loss="3" home_win="2" home_loss="10" game_data_directory="/components/game/mlb/year_2017/month_04/day_16/gid_2017_04_16_balmlb_tormlb_1" league="AA">
<status status="Final" ind="F" reason="" inning="9" top_inning="N" b="0" s="0" o="3" inning_state="" note="" is_perfect_game="N" is_no_hitter="N"/>
<linescore>...</linescore>
<home_runs>...</home_runs>
<winning_pitcher id="605164" last="Bundy" first="Dylan" name_display_roster="Bundy" number="37" era="1.86" wins="2" losses="1"/>
<losing_pitcher id="457918" last="Happ" first="J.A." name_display_roster="Happ" number="33" era="4.50" wins="0" losses="3"/>
<save_pitcher id="" last="" first="" number="" name_display_roster="" era="0" wins="0" losses="0" saves="0" svo="0"/>
<links mlbtv="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'video'})" wrapup="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=wrap&c_id=mlb" home_audio="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'audio'})" away_audio="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'audio'})" home_preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" away_preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" tv_station="SNET-1"/>
<broadcast>...</broadcast>
<alerts text="Final score in Toronto: Baltimore 11, Toronto 4" brief_text="At TOR: Final - BAL 11, TOR 4" type="status"/>
<game_media>...</game_media>
<video_thumbnail>...</video_thumbnail>
<video_thumbnails>...</video_thumbnails>
</game>
<game>...</game> (etc...)

The below is a snippet of code I am using to try and find the game (not games ) tag, and it's attributes.下面是我用来尝试查找game (不是games )标签及其属性的代码片段。 The issue is, when I request game, it returns None.问题是,当我请求游戏时,它返回 None。 However, I can return any other tag without an issue-- status , for example, works perfectly fine.但是,我可以毫无问题地返回任何其他标签——例如, status工作得很好。

soup = BeautifulSoup(webpage, 'xml') # webpage is the xml file for today's games
tags = soup.findAll('game', {'home_file_code': 'tor'}) #supposed to find the tags for the home_file_code matching the home team's abbreviation
for games in tags:
    print(games.find('status')['status'] #works without an issue
    print(games.find('game')['home_file_code'] #throws below error, because games.find('game') is None

TypeError: 'NoneType' object is not subscriptable TypeError: 'NoneType' 对象不可下标

Also, if I print the children for list ( print(list(games.children)) ), it returns everything except game.另外,如果我打印列表的子项( print(list(games.children)) ),它会返回除游戏之外的所有内容。

Is there something I'm missing about the XML as to why it can't grab that first tag?关于 XML 为何无法抓取第一个标签,我是否遗漏了什么? I'm pretty confused because this was working for me not too long ago, and I'm not sure what I changed that's causing the error.我很困惑,因为不久前这对我有用,而且我不确定我更改了什么导致错误。

It appears I misunderstood the find function.看来我误解了 find 功能。 You can index it for a keyword to lookup within the tag itself the attribute you want.您可以为关键字编制索引,以便在标签本身中查找所需的属性。 So, essentially I should have been doing the following:所以,基本上我应该做以下事情:

soup = BeautifulSoup(webpage, 'xml') # webpage is the xml file for today's games
tags = soup.findAll('game', {'home_file_code': 'tor'})
for games in tags:
    print(games.find('status')['status']
    print(games['home_file_code'])

Now print(games['home_file_code'] will find the home_file_code as expected because it already exists within the tag we looked up.现在print(games['home_file_code']将按预期找到home_file_code ,因为它已经存在于我们查找的标签中。

I'm sure someone can give a more thorough answer, but that was the fundamental misunderstanding I was having.我敢肯定有人可以给出更彻底的答案,但这是我遇到的根本误解。

I'm not the greatest of programmers, but I'm pretty sure you're not finding the first tag because it is incorrectly defined.我不是最伟大的程序员,但我很确定你没有找到第一个标签,因为它定义不正确。 XML tags, if they contain anything, must have an opening and a closing part like this: <games>year="2017" month="04" day="16"</games> and not like this: <games year="2017" month="04" day="16"> So first thing you need to fix your XML formatting and then take it from there. XML 标签(如果它们包含任何内容)必须具有如下所示的开头和结尾部分: <games>year="2017" month="04" day="16"</games>而不是这样: <games year="2017" month="04" day="16">所以首先你需要修复你的 XML 格式,然后从那里开始。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM