繁体   English   中英

从 Python 到 Excel - 构建 excel 工作表

[英]From Python to Excel - Building an excel worksheet

在一些非常善良的人的帮助下,我终于得到了一个可以抓取一些数据的工作脚本。 我现在希望以特定格式将此数据从 Python 传输到 Excel。 我尝试了多种方法,但未能获得预期的结果。

我的脚本如下:

import requests
from bs4 import BeautifulSoup


def analyze(i):
    url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}/1"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    arena = soup.find("span", attrs=('name')).text
    title = soup.select_one("[class='team'] .name a").text
    point = soup.select(".result .points")[0].text
    image_titles = ', '.join([i['title'] for i in soup.select("[class$='dead'] > img")])

    title_ano = soup.select("[class='team'] .name a")[1].text
    point_ano = soup.select(".result .points")[1].text
    image_titles_ano = ', '.join([i['title'] for i in soup.select("[class='class'] > img")])

    print((title,point,image_titles),(title_ano,point_ano,image_titles_ano),arena)


for i in range(46270, 46394):  
    analyze(i)

总而言之,我抓取了几件事:

  • 团队名称(title & title_ano)
  • 图片标题(image_titles & image_titles_ano)
  • 团队积分(points & points_ano)
  • 一串文字(arena)

output 的一行目前看起来是这样的:

('Thunder', '0 pts', 'roublard, huppermage, ecaflip') ('Tweaps', '60 pts', 'steamer, feca, sacrieur') A10

我的目标是将这个 output 转移到 excel,让它看起来像这样:

我想要的

澄清一下,就我拥有的变量而言,它是这样的: 就我的变量而言,我想要什么

目前我可以设法将我的数据传输到 excel,但我不知道如何以这种方式格式化我的数据。 任何帮助将不胜感激:)

首先,您使用的代码实际上并不完全正确。 例如:

analyze(46275)
(('Grind', '10 pts', 'roublard, ecaflip'), 
('SOLARY', '50 pts', 'enutrof, eniripsa, steamer, eliotrope'), 'A10')

请注意,第一个玩家只有两个图像标题,第二个玩家有四个 这是不正确的,因为您的代码假定class以“dead”结尾的img标签属于第一个玩家,而class名为“class”的 img 标签属于第二个。 这恰好适用于您的第一场比赛(即https://ktarena.com/fr/207-dofus-world-cup/match/46270 ),但通常这根本不是真的。 例如,如果我将下面的结果与应用于您的analyze function 的相同方法进行比较,我最终得到的不匹配是 248 行中的 118 行。

这是一个建议的重写:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def analyze_new(i):
    # You don't need `/1` at the end of the url
    url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    
    arena = soup.find('span',class_='name').get_text()
    
    # find all teams, and look for info inside each team
    teams = soup.findAll('div',class_='team')
    my_teams = [tuple()]*2
    for idx, team in enumerate(teams):
        my_teams[idx] = my_teams[idx] + \
            (team.select(".name a")[0].get_text(),)
        my_teams[idx] = my_teams[idx] + \
            (soup.select(".result .points")[idx].get_text(),)
        my_teams[idx] = my_teams[idx] + \
            (', '.join([img['title'] for img in team.findAll('img')[1:]]),)

    # notice, we need `return` instead of `print` to use the data     
    return *my_teams,arena

print(analyze_new(46275))
(('Grind', '10 pts', 'roublard, ecaflip, enutrof'), 
('SOLARY', '50 pts', 'eniripsa, steamer, eliotrope'), 'A10')

在将此数据写入 excel 之前,我会创建一个pd.DataFrame ,然后可以很容易地导出它:

# capture info per player in a single row
rows = []

for i in range(46270, 46394):
    one, two, arena = analyze_new(i)
    # adding `i` to rows, as "Match" seems like a useful `column` to have!
    # but if not, you can delete `i` here below (N.B. do NOT delete the COMMA!)
    # and cut 'Match' twice below
    rows.append(one+(arena,i))
    rows.append(two+(arena,i))

cols = ['Team','Points', 'Images', 'Arena','Match']

# create df
df = pd.DataFrame(data=rows,columns=cols)

# split up the images strings in `df.Images` and make new columns for them
# finally, drop the `df.Images` column itself
df = pd.concat([df,
                df.Images.str.split(',',expand=True)\
                    .rename(columns={i:f'Image Title {i+1}' 
                                     for i in range(3)})], axis=1)\
    .drop('Images', axis=1)

# Strip " pts" from the strings in `df.Points` and convert the type to an `int`
df['Points'] = df.Points.str.replace(' pts','').astype(int)

# Re-order the columns
df = df.loc[:, ['Match', 'Arena','Team', 'Image Title 1', 'Image Title 2', 
                'Image Title 3', 'Points']]

print(df.head())

   Match Arena         Team Image Title 1 Image Title 2 Image Title 3  Points
0  46270   A10      Thunder      roublard    huppermage       ecaflip       0
1  46270   A10       Tweaps       steamer          feca      sacrieur      60
2  46271   A10   Shadow Zoo          feca      osamodas       ouginak       0
3  46271   A10  UndisClosed      eniripsa          sram       pandawa      60
4  46272   A10   Laugh Tale      osamodas       ecaflip           iop       0

# Finally, write the `df` to an Excel file
df.to_excel('fname.xlsx')

结果:

df_to_excel

如果你不喜欢添加到header行和index列的默认 styles,你可以像这样写掉它:

df.T.reset_index().T.to_excel('test.xlsx', index=False, header=False)

结果:

df_to_excel_ex_header_index

顺便说一句,我假设您有特殊原因希望 function 将相关数据返回为*my_teams,arena 如果没有,最好让 function 自己完成大部分繁重的工作。 例如,我们可以这样写,然后直接返回一个df

def analyze_dict(i):
    url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    
    d = {'Match': [i]*2,
               'Arena': [soup.find('span',class_='name').get_text()]*2,
               'Team': [],
               'Image Title 1': [],
               'Image Title 2': [],
               'Image Title 3': [],
               'Points': [],
               }
    
    teams = soup.findAll('div',class_='team')
    for idx, team in enumerate(teams):
        d['Team'].append(team.select(".name a")[0].get_text())
        d['Points'].append(int(soup.select(".result .points")[idx].get_text().split(' ')[0]))
        for img_idx, img in enumerate(team.findAll('img')[1:]):
            d[f'Image Title {img_idx+1}'].append(img['title'])
        
    return pd.DataFrame(d)

print(analyze_dict(46275))

   Match Arena    Team Image Title 1 Image Title 2 Image Title 3  Points
0  46275   A10   Grind      roublard       ecaflip       enutrof      10
1  46275   A10  SOLARY      eniripsa       steamer     eliotrope      50

现在,我们只需要在 function 之外进行如下操作:

dfs = []

for i in range(46270, 46394):
    dfs.append(analyze_dict(i))

df = pd.concat(dfs, axis=0, ignore_index=True)

print(df.head())


   Match Arena         Team Image Title 1 Image Title 2 Image Title 3  Points
0  46270   A10      Thunder      roublard    huppermage       ecaflip       0
1  46270   A10       Tweaps       steamer          feca      sacrieur      60
2  46271   A10   Shadow Zoo          feca      osamodas       ouginak       0
3  46271   A10  UndisClosed      eniripsa          sram       pandawa      60
4  46272   A10   Laugh Tale      osamodas       ecaflip           iop       0

您的帖子几乎没有任何变化,您可以使用openpyxl库将 output 写入 excel 文件,如下所示:

import requests
from openpyxl import Workbook
from bs4 import BeautifulSoup


def analyze(i):
    url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}/1"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    arena = soup.find("span", attrs=('name')).text
    title = soup.select_one("[class='team'] .name a").text
    point = soup.select(".result .points")[0].text
    image_titles = image_titles = [i['title'] for i in soup.select("[class='team']:nth-of-type(1) [class^='class'] > img")]
    try:
        image_title_one = image_titles[0]
    except IndexError: image_title_one = ""
    try:
        image_title_two = image_titles[1]
    except IndexError: image_title_two = ""
    try:
        image_title_three = image_titles[2]
    except IndexError: image_title_three = ""
    
    ws.append([arena,title,image_title_one,image_title_two,image_title_three,point])
    
    title_ano = soup.select("[class='team'] .name a")[1].text
    point_ano = soup.select(".result .points")[1].text
    image_titles_ano = [i['title'] for i in soup.select("[class='team']:nth-of-type(2) [class^='class'] > img")]
    try:
        image_title_ano_one = image_titles_ano[0]
    except IndexError: image_title_ano_one = ""
    try:
        image_title_ano_two = image_titles_ano[1]
    except IndexError: image_title_ano_two = ""
    try:
        image_title_ano_three = image_titles_ano[2]
    except IndexError: image_title_ano_three = ""

    ws.append([arena,title_ano,image_title_ano_one,image_title_ano_two,image_title_ano_three,point_ano])
    print((title,point,image_titles),(title_ano,point_ano,image_titles_ano),arena)


if __name__ == '__main__':
    wb = Workbook()
    wb.remove(wb['Sheet'])
    ws = wb.create_sheet("result")
    ws.append(['Arena','Team','Image Title 1','Image Title 2','Image Title 3','Points'])
    for i in range(46270, 46290):  
        analyze(i)
    wb.save("output.xlsx")

我已经修复了选择器以获取正确数量的图像标题。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM