[英]Newbie trying to scrape data and break it up
I am able to scrape some data from a website but i am having trouble break it up to display it in a table. 我能够从网站上抓取一些数据,但是我很难将其分解以显示在表格中。
The code I use is: 我使用的代码是:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
tablesright = soup.find_all('td', 'right',)
Tables left = soup.find_all('td', 'left')
print (tablesright + tablesleft)
This gives me the result like this: 这给了我这样的结果:
====================== RESTART: E:/2017/Python2/box2.py ======================
[<td class="right " data-stat="game_start_time">8:01 pm</td>, <td class="right " data-stat="visitor_pts">99</td>, <td class="right " data- stat="home_pts">102</td>, <td class="right " data-stat="game_start_time">10:30 pm</td>, <td class="right " data-stat="visitor_pts">122</td>, <td class="right " data-stat="home_pts">121</td>, <td class="right " data-stat="game_start_time">7:30 pm</td>, <td class="right " data-stat="visitor_pts">108</td>, <td class="right " data-stat="home_pts">100</td>, <td class="right " data-stat="game_start_time">8:30 pm</td>, <td class="right " data-stat="visitor_pts">117</td>, <td class="right " data-stat="home_pts">111</td>, <td class="right " data-stat="game_start_time">7:00 pm</td>, <td class="right " data-stat="visitor_pts">90</td>, <td class="right " data-stat="home_pts">102</td>, <
and the left part: 和左部分:
<td class="left " csk="BOS.201710170CLE" data-stat="visitor_team_name"><a href="/teams/BOS/2018.html">Boston Celtics</a></td>, <td class="left " csk="CLE.201710170CLE" data-stat="home_team_name"><a href="/teams/CLE/2018.html">Cleveland Cavaliers</a></td>, <td class="left " data-stat="game_remarks"></td>, <td class="left " csk="HOU.201710170GSW" data-stat="visitor_team_name"><a href="/teams/HOU/2018.html">Houston Rockets</a></td>, <td class="left " csk="GSW.201710170GSW" data-stat="home_team_name"><a href="/teams/GSW/2018.html">Golden State Warriors</a></td>, <td class="left " data-stat="game_remarks"></td>, <td class="left " csk="MIL.201710180BOS" data-stat="visitor_team_name"><a href="/teams/MIL/2018.html">Milwaukee Bucks</a></td>, <td class="left " csk="BOS.201710180BOS" data-stat="home_team_name"><a href="/teams/BOS/2018.html">Boston Celtics</a></td>, <td class="left " data-stat="game_remarks"></td>, <td class="left " csk="ATL.201710180DAL" data-
Ok so now I can not figure out how to break the result up so it would have a nice table like this: 好的,现在我无法弄清楚如何分解结果,因此它会有一个很好的表,如下所示:
Game start time Home team. Score. Away team. Score
7pm. Boston. 104. Golden state. 103
Pulling my hair out trying to figure it out, 拉出我的头发试图弄清楚,
Ta thanks in advance 提前谢谢
You could try reading that in a pandas dataframe instead of using the html parser and then decide how to manipulate that dataframe into showing the result you need. 您可以尝试在pandas数据框中读取它,而不使用html解析器,然后决定如何处理该数据框以显示所需的结果。
Example: 例:
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
dfs = pd.read_html(url, match="Start")
print(dfs[0])
Examples of how to do that in the pandas documentation as well as a lot of asked questions on stackoverflow. 有关如何做到这一点的示例,请参见pandas文档以及有关stackoverflow的许多常见问题。 Sauce: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
调味料: https : //pandas.pydata.org/pandas-docs/stable/genic/pandas.DataFrame.html
I don't know whether you want solution with pandas, this is one without it by just using more advanced attrs
keyword and standard Python format
to get formatted table. 我不知道您是否要使用熊猫解决方案,这是没有熊猫的解决方案,只需使用更高级的
attrs
关键字和标准Python format
来获取格式化表格。
Note that numbers in format
are choosen manually and does not adjust to actual data. 请注意,
format
中的数字是手动选择的,不会适应实际数据。
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
game_start_times = soup.find_all('td', attrs={"data-stat": "game_start_time", "class": "right"})
visitor_team_names = soup.find_all('td', attrs={"data-stat": "visitor_team_name", "class": "left"})
visitor_ptss = soup.find_all('td', attrs={"data-stat": "visitor_pts", "class": "right"})
home_team_names = soup.find_all('td', attrs={"data-stat": "home_team_name", "class": "left"})
home_pts = soup.find_all('td', attrs={"data-stat": "home_pts", "class": "right"})
for i in range(len(game_start_times)):
print('{:10s} {:28s} {:5s} {:28s} {:5s}'.format(game_start_times[i].text.strip(),
visitor_team_names[i].text.strip(),
visitor_ptss[i].text.strip(),
home_team_names[i].text.strip(),
home_pts[i].text.strip()))
8:01 pm Boston Celtics 99 Cleveland Cavaliers 102
10:30 pm Houston Rockets 122 Golden State Warriors 121
7:30 pm Milwaukee Bucks 108 Boston Celtics 100
8:30 pm Atlanta Hawks 117 Dallas Mavericks 111
This would work. 这会起作用。 Tune it up to your needs and use Panda afterwards.
根据您的需要进行调整,然后再使用Panda。
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
rows = soup.select('#schedule > tbody > tr')
for row in rows:
rights = row.find_all("td", "right")
lefts = row.find_all("td", "left")
print rights[0].text, lefts[0].text, rights[1].text, lefts[1].text, rights[2].text
For such a simple structure, I would just drop the libraries and do it with re (regular expressions) 对于这样一个简单的结构,我将删除库并使用re(正则表达式)进行处理
first one findall to get all tr tags 第一个findall获取所有tr标签
then one findall to get all td/th tags inside each tr tag 然后一个findall获取每个tr标签内的所有td / th标签
then one sub to filter out all tags inside the fields (mainly a tags) 然后一个子过滤掉字段内的所有标签(主要是标签)
#!/usr/bin/python
import requests
import re
url = 'https://www.basketball-
reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)
content = r.content
data = [
{
k:re.sub('<.+?>','',v) for (k,v) in re.findall('<t[dh].+?data\-stat="(.*?)".*?>(.*?)</t[dh]',tr)
} for tr in re.findall('<tr.+?>(.+?)</tr',content)
]
for game in data:
print "%s" % game['date_game']
for info in game:
print " %s = %s" % (info,game[info])
This gives a nice dict structure (data) which can be easily used for displaying as you like : 这提供了一个很好的dict结构(数据),可以轻松地根据需要使用它进行显示:
$ ./scores_url.py
Tue, Oct 17, 2017
game_remarks =
box_score_text = Box Score
home_team_name = Cleveland Cavaliers
visitor_team_name = Boston Celtics
game_start_time = 8:01 pm
date_game = Tue, Oct 17, 2017
overtimes =
visitor_pts = 99
home_pts = 102
Tue, Oct 17, 2017
game_remarks =
box_score_text = Box Score
home_team_name = Golden State Warriors
visitor_team_name = Houston Rockets
game_start_time = 10:30 pm
date_game = Tue, Oct 17, 2017
overtimes =
visitor_pts = 122
home_pts = 121
Wed, Oct 18, 2017
game_remarks =
box_score_text = Box Score
home_team_name = Boston Celtics
visitor_team_name = Milwaukee Bucks
game_start_time = 7:30 pm
date_game = Wed, Oct 18, 2017
overtimes =
visitor_pts = 108
home_pts = 100
...
or in the style of your example : 或按照您的示例样式:
cols = [
['game_start_time',15,"Game start time"],
['home_team_name',25,"Home team."],
['home_pts',7,"Score."],
['visitor_team_name',25,"Away team."],
['visitor_pts',7,"Score."]
]
for col in cols:
print ("%%%ds" % col[1]) % col[2],
print
for game in data:
for col in cols:
print ("%%%ds" % col[1]) % game[col[0]],
print
which gives something like this : 这给出了这样的东西:
Game start time Home team. Score. Away team. Score. 8:01 pm Cleveland Cavaliers 102 Boston Celtics 99 10:30 pm Golden State Warriors 121 Houston Rockets 122 7:30 pm Boston Celtics 100 Milwaukee Bucks 108 8:30 pm Dallas Mavericks 111 Atlanta Hawks 117 7:00 pm Detroit Pistons 102 Charlotte Hornets 90 7:00 pm Indiana Pacers 140 Brooklyn Nets 131 8:00 pm Memphis Grizzlies 103 New Orleans Pelicans 91 ...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.