简体   繁体   English

美丽的汤解析麻烦

[英]Beautiful Soup Parsing Troubles

I cant parse this xml that doesn't seem to have any references to a class. 我无法解析该xml,该xml似乎没有对类的任何引用。

A snippet of my code: 我的代码片段:

sock = urllib2.urlopen(l)
link = sock.read()

soup = BeautifulSoup(link,"xml")

FirstNameHome=soup.find('home_probable_pitcher','first_name')

I want to find the First Name for both the Home and Away Team: 我想找到主队和客队的名字:

(Theres only two instances, so not sure if i should be using findAll ) (只有两个实例,所以不确定我是否应该使用findAll

Here is the source using soup.prettify 这是使用soup.prettify的来源。

 LookupError: unknown encoding: <?xml version="1.0" encoding="UTF-8"?><!--Copyright 2017 MLB Advanced Media, L.P.  Use of any content on this page acknowledges agreement to the terms posted here http://gdx.mlb.com/components/copyright.txt-->
<game id="2017/06/02/nyamlb-tormlb-1" venue="Rogers Centre" game_pk="490921"
      time="7:07"
      time_date="2017/06/02 7:07"
      time_date_aw_lg="2017/06/02 7:07"
      time_date_hm_lg="2017/06/02 7:07"
      time_zone="ET"
      ampm="PM"
      first_pitch_et=""
      away_time="7:07"
      away_time_zone="ET"
      away_ampm="PM"
      home_time="7:07"
      home_time_zone="ET"
      home_ampm="PM"
      game_type="R"
      tiebreaker_sw="N"
      original_date="2017/06/02"
      time_zone_aw_lg="-4"
      time_zone_hm_lg="-4"
      time_aw_lg="7:07"
      aw_lg_ampm="PM"
      tz_aw_lg_gen="ET"
      time_hm_lg="7:07"
      hm_lg_ampm="PM"
      tz_hm_lg_gen="ET"
      venue_id="14"
      scheduled_innings="9"
      away_name_abbrev="NYY"
      home_name_abbrev="TOR"
      away_code="nya"
      away_file_code="nyy"
      away_team_id="147"
      away_team_city="NY Yankees"
      away_team_name="Yankees"
      away_division="E"
      away_league_id="103"
      away_sport_code="mlb"
      home_code="tor"
      home_file_code="tor"
      home_team_id="141"
      home_team_city="Toronto"
      home_team_name="Blue Jays"
      home_division="E"
      home_league_id="103"
      home_sport_code="mlb"
      day="FRI"
      gameday_sw="P"
      double_header_sw="N"
      game_nbr="1"
      tbd_flag="N"
      venue_w_chan_loc="CAXX0504"
      location="Toronto, Canada"
      gameday_link="2017_06_02_nyamlb_tormlb_1"
      away_win="30"
      away_loss="20"
      home_win="26"
      home_loss="27"
      game_data_directory="/components/game/mlb/year_2017/month_06/day_02/gid_2017_06_02_nyamlb_tormlb_1"
      league="AA"
      inning_state=""
      note=""
      status="Preview"
      ind="S"
      tv_station="SNET-1, MLBN (out-of-market only)">
   <home_probable_pitcher id="434538" first_name="Francisco" first="Francisco" last_name="Liriano"
                          last="Liriano"
                          name_display_roster="Liriano"
                          number="45"
                          throwinghand="LHP"
                          wins="2"
                          losses="2"
                          era="6.35"
                          s_wins="2"
                          s_losses="2"
                          s_era="6.35"
                          stats_season="2017"
                          stats_type="R"/>
   <away_probable_pitcher id="501381" first_name="Michael" first="Michael" last_name="Pineda"
                          last="Pineda"
                          name_display_roster="Pineda"
                          number="35"
                          throwinghand="RHP"
                          wins="6"
                          losses="2"
                          era="3.32"
                          s_wins="6"
                          s_losses="2"
                          s_era="3.32"
                          stats_season="2017"
                          stats_type="R"/>
   <game_media>
      <media type="game" calendar_event_id="14-490921-2017-06-02"
             start="2017-06-02T19:07:00-0400"
             title="NYY @ TOR"
             has_mlbtv="true"
             free="NO"
             enhanced="N"
             media_state="media_off"
             thumbnail="http://mediadownloads.mlb.com/mlbam/preview/nyator_490921_th_7_preview.jpg"/>
   </game_media>
</game>

if we write 如果我们写

# for Python 3
# import urllib.request

import urllib2

from bs4 import BeautifulSoup

l = 'http://gd2.mlb.com/components/game/mlb/year_2017/month_06/day_03/gid_2017_06_03_arimlb_miamlb_1/linescore.xml'

sock = urllib2.urlopen(l)
# for Python 3
# sock = urllib.request.urlopen(l)
link = sock.read()

soup = BeautifulSoup(link, "xml")

FirstNameHome = soup.find('home_probable_pitcher').attrs['first_name']
print(FirstNameHome)

it gives 它给

Edinson

also

print(soup.prettify(encoding='utf-8'))

gives

<?xml version="1.0" encoding="utf-8"?>
<!--Copyright 2017 MLB Advanced Media, L.P.  Use of any content on this page acknowledges agreement to the terms posted here http://gdx.mlb.com/components/copyright.txt-->
<game ampm="PM" aw_lg_ampm="PM" away_ampm="PM" away_code="ari" away_division="W" away_file_code="ari" away_league_id="104" away_loss="22" away_name_abbrev="ARI" away_sport_code="mlb" away_team_city="Arizona" away_team_id="109" away_team_name="D-backs" away_time="1:10" away_time_zone="MST" away_win="34" day="SAT" double_header_sw="N" first_pitch_et="" game_data_directory="/components/game/mlb/year_2017/month_06/day_03/gid_2017_06_03_arimlb_miamlb_1" game_nbr="1" game_pk="490927" game_type="R" gameday_link="2017_06_03_arimlb_miamlb_1" gameday_sw="P" hm_lg_ampm="PM" home_ampm="PM" home_code="mia" home_division="E" home_file_code="mia" home_league_id="104" home_loss="31" home_name_abbrev="MIA" home_sport_code="mlb" home_team_city="Miami" home_team_id="146" home_team_name="Marlins" home_time="4:10" home_time_zone="ET" home_win="21" id="2017/06/03/arimlb-miamlb-1" ind="S" inning_state="" league="NN" location="Miami, FL" note="" original_date="2017/06/03" scheduled_innings="9" status="Preview" tbd_flag="N" tiebreaker_sw="N" time="4:10" time_aw_lg="4:10" time_date="2017/06/03 4:10" time_date_aw_lg="2017/06/03 4:10" time_date_hm_lg="2017/06/03 4:10" time_hm_lg="4:10" time_zone="ET" time_zone_aw_lg="-4" time_zone_hm_lg="-4" tv_station="FS-F, MLBN (out-of-market only)" tz_aw_lg_gen="ET" tz_hm_lg_gen="ET" venue="Marlins Park" venue_id="4169" venue_w_chan_loc="USFL0316">
 <home_probable_pitcher era="4.44" first="Edinson" first_name="Edinson" id="450172" last="Volquez" last_name="Volquez" losses="7" name_display_roster="Volquez" number="36" s_era="4.44" s_losses="7" s_wins="1" stats_season="2017" stats_type="R" throwinghand="RHP" wins="1"/>
 <away_probable_pitcher era="3.47" first="Randall" first_name="Randall" id="517414" last="Delgado" last_name="Delgado" losses="0" name_display_roster="Delgado" number="48" s_era="3.47" s_losses="0" s_wins="1" stats_season="2017" stats_type="R" throwinghand="RHP" wins="1"/>
 <game_media>
  <media calendar_event_id="14-490927-2017-06-03" enhanced="N" free="NO" has_mlbtv="true" media_state="media_off" start="2017-06-03T16:10:00-0400" thumbnail="http://mediadownloads.mlb.com/mlbam/preview/arimia_490927_th_7_preview.jpg" title="ARI @ MIA" type="game"/>
 </game_media>
</game>

EDIT 编辑

I can reproduce your error only when i pass link object (or str(soup) ) to prettify method 仅当我将link对象(或str(soup) )传递给prettify方法时,我才能重现您的错误

soup.prettify(link)

well, it is not what you need, because prettify arguments can be encoding ( 'utf-8' for example) and formatter (defaults to 'minimal' ), not raw contents, so just write 好吧,这不是您所需要的,因为prettify参数可以是encoding (例如'utf-8' )和formatter (默认为'minimal' ),而不是原始内容,因此只需编写

pretty = soup.prettify()

and it will give 它会给

>>> type(pretty)
<type 'unicode'>

or specify encoding 或指定编码

>>> pretty = soup.prettify(encoding='utf-8')

and it will give 它会给

>>> type(pretty)
<type 'str'>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM