简体   繁体   English

新手试图抓取数据并将其分解

[英]Newbie trying to scrape data and break it up

I am able to scrape some data from a website but i am having trouble break it up to display it in a table. 我能够从网站上抓取一些数据,但是我很难将其分解以显示在表格中。

The code I use is: 我使用的代码是:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")
tablesright = soup.find_all('td', 'right',)
Tables left = soup.find_all('td', 'left')
print (tablesright + tablesleft)

This gives me the result like this: 这给了我这样的结果:

====================== RESTART: E:/2017/Python2/box2.py   ======================
[<td class="right " data-stat="game_start_time">8:01 pm</td>, <td class="right " data-stat="visitor_pts">99</td>, <td class="right " data- stat="home_pts">102</td>, <td class="right " data-stat="game_start_time">10:30 pm</td>, <td class="right " data-stat="visitor_pts">122</td>, <td class="right " data-stat="home_pts">121</td>, <td class="right " data-stat="game_start_time">7:30 pm</td>, <td class="right " data-stat="visitor_pts">108</td>, <td class="right " data-stat="home_pts">100</td>, <td class="right " data-stat="game_start_time">8:30 pm</td>, <td class="right " data-stat="visitor_pts">117</td>, <td class="right " data-stat="home_pts">111</td>, <td class="right " data-stat="game_start_time">7:00 pm</td>, <td class="right " data-stat="visitor_pts">90</td>, <td class="right " data-stat="home_pts">102</td>, <

and the left part: 和左部分:

<td class="left " csk="BOS.201710170CLE" data-stat="visitor_team_name"><a href="/teams/BOS/2018.html">Boston Celtics</a></td>, <td class="left " csk="CLE.201710170CLE" data-stat="home_team_name"><a href="/teams/CLE/2018.html">Cleveland Cavaliers</a></td>, <td class="left " data-stat="game_remarks"></td>, <td class="left " csk="HOU.201710170GSW" data-stat="visitor_team_name"><a href="/teams/HOU/2018.html">Houston Rockets</a></td>, <td class="left " csk="GSW.201710170GSW" data-stat="home_team_name"><a href="/teams/GSW/2018.html">Golden State Warriors</a></td>, <td class="left " data-stat="game_remarks"></td>, <td class="left " csk="MIL.201710180BOS" data-stat="visitor_team_name"><a href="/teams/MIL/2018.html">Milwaukee Bucks</a></td>, <td class="left " csk="BOS.201710180BOS" data-stat="home_team_name"><a href="/teams/BOS/2018.html">Boston Celtics</a></td>, <td class="left " data-stat="game_remarks"></td>, <td class="left " csk="ATL.201710180DAL" data-

Ok so now I can not figure out how to break the result up so it would have a nice table like this: 好的,现在我无法弄清楚如何分解结果,因此它会有一个很好的表,如下所示:

Game start time    Home team.     Score.   Away team.    Score
7pm.               Boston.        104.     Golden state.  103

Pulling my hair out trying to figure it out, 拉出我的头发试图弄清楚,

Ta thanks in advance 提前谢谢

You could try reading that in a pandas dataframe instead of using the html parser and then decide how to manipulate that dataframe into showing the result you need. 您可以尝试在pandas数据框中读取它,而不使用html解析器,然后决定如何处理该数据框以显示所需的结果。

Example: 例:

import pandas as pd


url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
dfs = pd.read_html(url, match="Start")
print(dfs[0])

Examples of how to do that in the pandas documentation as well as a lot of asked questions on stackoverflow. 有关如何做到这一点的示例,请参见pandas文档以及有关stackoverflow的许多常见问题。 Sauce: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html 调味料: https : //pandas.pydata.org/pandas-docs/stable/genic/pandas.DataFrame.html

I don't know whether you want solution with pandas, this is one without it by just using more advanced attrs keyword and standard Python format to get formatted table. 我不知道您是否要使用熊猫解决方案,这是没有熊猫的解决方案,只需使用更高级的attrs关键字和标准Python format来获取格式化表格。

Note that numbers in format are choosen manually and does not adjust to actual data. 请注意, format中的数字是手动选择的,不会适应实际数据。

import requests
from bs4 import BeautifulSoup


url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")
game_start_times = soup.find_all('td', attrs={"data-stat": "game_start_time", "class": "right"})
visitor_team_names = soup.find_all('td', attrs={"data-stat": "visitor_team_name", "class": "left"})
visitor_ptss = soup.find_all('td', attrs={"data-stat": "visitor_pts", "class": "right"})
home_team_names = soup.find_all('td', attrs={"data-stat": "home_team_name", "class": "left"})
home_pts = soup.find_all('td', attrs={"data-stat": "home_pts", "class": "right"})

for i in range(len(game_start_times)):
    print('{:10s} {:28s} {:5s} {:28s} {:5s}'.format(game_start_times[i].text.strip(),
                                  visitor_team_names[i].text.strip(),
                                  visitor_ptss[i].text.strip(),
                                  home_team_names[i].text.strip(),
                                  home_pts[i].text.strip()))

8:01 pm    Boston Celtics               99    Cleveland Cavaliers          102
10:30 pm   Houston Rockets              122   Golden State Warriors        121
7:30 pm    Milwaukee Bucks              108   Boston Celtics               100
8:30 pm    Atlanta Hawks                117   Dallas Mavericks             111

This would work. 这会起作用。 Tune it up to your needs and use Panda afterwards. 根据您的需要进行调整,然后再使用Panda。

import requests
from bs4 import BeautifulSoup


url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")

rows = soup.select('#schedule > tbody > tr')

for row in rows:
    rights = row.find_all("td", "right")
    lefts = row.find_all("td", "left")

    print rights[0].text, lefts[0].text, rights[1].text, lefts[1].text, rights[2].text

For such a simple structure, I would just drop the libraries and do it with re (regular expressions) 对于这样一个简单的结构,我将删除库并使用re(正则表达式)进行处理

first one findall to get all tr tags 第一个findall获取所有tr标签

then one findall to get all td/th tags inside each tr tag 然后一个findall获取每个tr标签内的所有td / th标签

then one sub to filter out all tags inside the fields (mainly a tags) 然后一个过滤掉字段内的所有标签(主要是标签)

#!/usr/bin/python

import requests
import re

url = 'https://www.basketball-
reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)
content = r.content

data = [
    {
            k:re.sub('<.+?>','',v) for (k,v) in re.findall('<t[dh].+?data\-stat="(.*?)".*?>(.*?)</t[dh]',tr)
    } for tr in re.findall('<tr.+?>(.+?)</tr',content)
    ]

for game in data:
  print "%s" % game['date_game']
  for info in game:
    print "  %s = %s" % (info,game[info])

This gives a nice dict structure (data) which can be easily used for displaying as you like : 这提供了一个很好的dict结构(数据),可以轻松地根据需要使用它进行显示:

$ ./scores_url.py 
Tue, Oct 17, 2017
  game_remarks = 
  box_score_text = Box Score
  home_team_name = Cleveland Cavaliers
  visitor_team_name = Boston Celtics
  game_start_time = 8:01 pm
  date_game = Tue, Oct 17, 2017
  overtimes = 
  visitor_pts = 99
  home_pts = 102
Tue, Oct 17, 2017
  game_remarks = 
  box_score_text = Box Score
  home_team_name = Golden State Warriors
  visitor_team_name = Houston Rockets
  game_start_time = 10:30 pm
  date_game = Tue, Oct 17, 2017
  overtimes = 
  visitor_pts = 122
  home_pts = 121
Wed, Oct 18, 2017
  game_remarks = 
  box_score_text = Box Score
  home_team_name = Boston Celtics
  visitor_team_name = Milwaukee Bucks
  game_start_time = 7:30 pm
  date_game = Wed, Oct 18, 2017
  overtimes = 
  visitor_pts = 108
  home_pts = 100
...

or in the style of your example : 或按照您的示例样式:

cols = [
        ['game_start_time',15,"Game start time"],
        ['home_team_name',25,"Home team."],
        ['home_pts',7,"Score."],
        ['visitor_team_name',25,"Away team."],
        ['visitor_pts',7,"Score."]
       ]

for col in cols:
  print ("%%%ds" % col[1]) % col[2],
print

for game in data:
  for col in cols:
    print ("%%%ds" % col[1]) % game[col[0]],
  print

which gives something like this : 这给出了这样的东西:

Game start time                Home team.  Score.                Away team.  Score.
        8:01 pm       Cleveland Cavaliers     102            Boston Celtics      99
       10:30 pm     Golden State Warriors     121           Houston Rockets     122
        7:30 pm            Boston Celtics     100           Milwaukee Bucks     108
        8:30 pm          Dallas Mavericks     111             Atlanta Hawks     117
        7:00 pm           Detroit Pistons     102         Charlotte Hornets      90
        7:00 pm            Indiana Pacers     140             Brooklyn Nets     131
        8:00 pm         Memphis Grizzlies     103      New Orleans Pelicans      91
    ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM