[英]From Python to Excel - Building an excel worksheet
With the help of some very kind people on here I finally got a working script to scrape some data.在一些非常善良的人的帮助下,我终于得到了一个可以抓取一些数据的工作脚本。 I now desire to transfer this data from Python to Excel, in a specific format.
我现在希望以特定格式将此数据从 Python 传输到 Excel。 I have tried multiple approaches, but did not manage to get the desired result.
我尝试了多种方法,但未能获得预期的结果。
My script is the following:我的脚本如下:
import requests
from bs4 import BeautifulSoup
def analyze(i):
url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}/1"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
arena = soup.find("span", attrs=('name')).text
title = soup.select_one("[class='team'] .name a").text
point = soup.select(".result .points")[0].text
image_titles = ', '.join([i['title'] for i in soup.select("[class$='dead'] > img")])
title_ano = soup.select("[class='team'] .name a")[1].text
point_ano = soup.select(".result .points")[1].text
image_titles_ano = ', '.join([i['title'] for i in soup.select("[class='class'] > img")])
print((title,point,image_titles),(title_ano,point_ano,image_titles_ano),arena)
for i in range(46270, 46394):
analyze(i)
To summarize, I scrape a couple of things:总而言之,我抓取了几件事:
One line of output currently looks like this: output 的一行目前看起来是这样的:
('Thunder', '0 pts', 'roublard, huppermage, ecaflip') ('Tweaps', '60 pts', 'steamer, feca, sacrieur') A10
My goal is to transfer this output to excel, making it look like this:我的目标是将这个 output 转移到 excel,让它看起来像这样:
To clarify, in terms of the variables I have it would be this:澄清一下,就我拥有的变量而言,它是这样的:
Currently I can manage to transfer my data to excel, but I can't figure out how to format my data this way.目前我可以设法将我的数据传输到 excel,但我不知道如何以这种方式格式化我的数据。 Any help would be greatly appreciated:)
任何帮助将不胜感激:)
First of all, the code that you are using is not actually wholly correct.首先,您使用的代码实际上并不完全正确。 Eg:
例如:
analyze(46275)
(('Grind', '10 pts', 'roublard, ecaflip'),
('SOLARY', '50 pts', 'enutrof, eniripsa, steamer, eliotrope'), 'A10')
Notice that the first player only has two image titles, and the second one has four .请注意,第一个玩家只有两个图像标题,第二个玩家有四个。 This is incorrect, and happens because your code assumes that
img
tags with the class
ending in "dead" belong to the first player, and the ones that have a class
named "class" belong to the second.这是不正确的,因为您的代码假定
class
以“dead”结尾的img
标签属于第一个玩家,而class
名为“class”的 img 标签属于第二个。 This happens to be true for your first match (ie https://ktarena.com/fr/207-dofus-world-cup/match/46270
), but very often this is not true at all.这恰好适用于您的第一场比赛(即
https://ktarena.com/fr/207-dofus-world-cup/match/46270
),但通常这根本不是真的。 Eg if I compare my result below with the same method applied to your analyze
function, I end up with mismatches is 118 rows out of 248.例如,如果我将下面的结果与应用于您的
analyze
function 的相同方法进行比较,我最终得到的不匹配是 248 行中的 118 行。
Here's a suggested rewrite:这是一个建议的重写:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def analyze_new(i):
# You don't need `/1` at the end of the url
url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
arena = soup.find('span',class_='name').get_text()
# find all teams, and look for info inside each team
teams = soup.findAll('div',class_='team')
my_teams = [tuple()]*2
for idx, team in enumerate(teams):
my_teams[idx] = my_teams[idx] + \
(team.select(".name a")[0].get_text(),)
my_teams[idx] = my_teams[idx] + \
(soup.select(".result .points")[idx].get_text(),)
my_teams[idx] = my_teams[idx] + \
(', '.join([img['title'] for img in team.findAll('img')[1:]]),)
# notice, we need `return` instead of `print` to use the data
return *my_teams,arena
print(analyze_new(46275))
(('Grind', '10 pts', 'roublard, ecaflip, enutrof'),
('SOLARY', '50 pts', 'eniripsa, steamer, eliotrope'), 'A10')
Before writing this data to excel, I would create a pd.DataFrame
, which can then be exported very easily:在将此数据写入 excel 之前,我会创建一个
pd.DataFrame
,然后可以很容易地导出它:
# capture info per player in a single row
rows = []
for i in range(46270, 46394):
one, two, arena = analyze_new(i)
# adding `i` to rows, as "Match" seems like a useful `column` to have!
# but if not, you can delete `i` here below (N.B. do NOT delete the COMMA!)
# and cut 'Match' twice below
rows.append(one+(arena,i))
rows.append(two+(arena,i))
cols = ['Team','Points', 'Images', 'Arena','Match']
# create df
df = pd.DataFrame(data=rows,columns=cols)
# split up the images strings in `df.Images` and make new columns for them
# finally, drop the `df.Images` column itself
df = pd.concat([df,
df.Images.str.split(',',expand=True)\
.rename(columns={i:f'Image Title {i+1}'
for i in range(3)})], axis=1)\
.drop('Images', axis=1)
# Strip " pts" from the strings in `df.Points` and convert the type to an `int`
df['Points'] = df.Points.str.replace(' pts','').astype(int)
# Re-order the columns
df = df.loc[:, ['Match', 'Arena','Team', 'Image Title 1', 'Image Title 2',
'Image Title 3', 'Points']]
print(df.head())
Match Arena Team Image Title 1 Image Title 2 Image Title 3 Points
0 46270 A10 Thunder roublard huppermage ecaflip 0
1 46270 A10 Tweaps steamer feca sacrieur 60
2 46271 A10 Shadow Zoo feca osamodas ouginak 0
3 46271 A10 UndisClosed eniripsa sram pandawa 60
4 46272 A10 Laugh Tale osamodas ecaflip iop 0
# Finally, write the `df` to an Excel file
df.to_excel('fname.xlsx')
Result:结果:
If you dislike the default styles added to the header
row and index
column, you can write it away like so:如果你不喜欢添加到
header
行和index
列的默认 styles,你可以像这样写掉它:
df.T.reset_index().T.to_excel('test.xlsx', index=False, header=False)
Result:结果:
Incidentally, I assume you have a particular reason for wanting the function to return the relevant data as *my_teams,arena
.顺便说一句,我假设您有特殊原因希望 function 将相关数据返回为
*my_teams,arena
。 If not, it would be better to let the function itself do most of the heavy lifting.如果没有,最好让 function 自己完成大部分繁重的工作。 Eg we could write something like this, and return a
df
directly.例如,我们可以这样写,然后直接返回一个
df
。
def analyze_dict(i):
url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
d = {'Match': [i]*2,
'Arena': [soup.find('span',class_='name').get_text()]*2,
'Team': [],
'Image Title 1': [],
'Image Title 2': [],
'Image Title 3': [],
'Points': [],
}
teams = soup.findAll('div',class_='team')
for idx, team in enumerate(teams):
d['Team'].append(team.select(".name a")[0].get_text())
d['Points'].append(int(soup.select(".result .points")[idx].get_text().split(' ')[0]))
for img_idx, img in enumerate(team.findAll('img')[1:]):
d[f'Image Title {img_idx+1}'].append(img['title'])
return pd.DataFrame(d)
print(analyze_dict(46275))
Match Arena Team Image Title 1 Image Title 2 Image Title 3 Points
0 46275 A10 Grind roublard ecaflip enutrof 10
1 46275 A10 SOLARY eniripsa steamer eliotrope 50
Now, we only need to do the following outside the function:现在,我们只需要在 function 之外进行如下操作:
dfs = []
for i in range(46270, 46394):
dfs.append(analyze_dict(i))
df = pd.concat(dfs, axis=0, ignore_index=True)
print(df.head())
Match Arena Team Image Title 1 Image Title 2 Image Title 3 Points
0 46270 A10 Thunder roublard huppermage ecaflip 0
1 46270 A10 Tweaps steamer feca sacrieur 60
2 46271 A10 Shadow Zoo feca osamodas ouginak 0
3 46271 A10 UndisClosed eniripsa sram pandawa 60
4 46272 A10 Laugh Tale osamodas ecaflip iop 0
With hardly any changes from your post, you can use the openpyxl library to write the output to an excel file as shown below:您的帖子几乎没有任何变化,您可以使用openpyxl库将 output 写入 excel 文件,如下所示:
import requests
from openpyxl import Workbook
from bs4 import BeautifulSoup
def analyze(i):
url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}/1"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
arena = soup.find("span", attrs=('name')).text
title = soup.select_one("[class='team'] .name a").text
point = soup.select(".result .points")[0].text
image_titles = image_titles = [i['title'] for i in soup.select("[class='team']:nth-of-type(1) [class^='class'] > img")]
try:
image_title_one = image_titles[0]
except IndexError: image_title_one = ""
try:
image_title_two = image_titles[1]
except IndexError: image_title_two = ""
try:
image_title_three = image_titles[2]
except IndexError: image_title_three = ""
ws.append([arena,title,image_title_one,image_title_two,image_title_three,point])
title_ano = soup.select("[class='team'] .name a")[1].text
point_ano = soup.select(".result .points")[1].text
image_titles_ano = [i['title'] for i in soup.select("[class='team']:nth-of-type(2) [class^='class'] > img")]
try:
image_title_ano_one = image_titles_ano[0]
except IndexError: image_title_ano_one = ""
try:
image_title_ano_two = image_titles_ano[1]
except IndexError: image_title_ano_two = ""
try:
image_title_ano_three = image_titles_ano[2]
except IndexError: image_title_ano_three = ""
ws.append([arena,title_ano,image_title_ano_one,image_title_ano_two,image_title_ano_three,point_ano])
print((title,point,image_titles),(title_ano,point_ano,image_titles_ano),arena)
if __name__ == '__main__':
wb = Workbook()
wb.remove(wb['Sheet'])
ws = wb.create_sheet("result")
ws.append(['Arena','Team','Image Title 1','Image Title 2','Image Title 3','Points'])
for i in range(46270, 46290):
analyze(i)
wb.save("output.xlsx")
I've fixed the selectors to grab the right number of image titles.我已经修复了选择器以获取正确数量的图像标题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.