简体   繁体   English

从 Python 到 Excel - 构建 excel 工作表

[英]From Python to Excel - Building an excel worksheet

With the help of some very kind people on here I finally got a working script to scrape some data.在一些非常善良的人的帮助下,我终于得到了一个可以抓取一些数据的工作脚本。 I now desire to transfer this data from Python to Excel, in a specific format.我现在希望以特定格式将此数据从 Python 传输到 Excel。 I have tried multiple approaches, but did not manage to get the desired result.我尝试了多种方法,但未能获得预期的结果。

My script is the following:我的脚本如下:

import requests
from bs4 import BeautifulSoup


def analyze(i):
    url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}/1"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    arena = soup.find("span", attrs=('name')).text
    title = soup.select_one("[class='team'] .name a").text
    point = soup.select(".result .points")[0].text
    image_titles = ', '.join([i['title'] for i in soup.select("[class$='dead'] > img")])

    title_ano = soup.select("[class='team'] .name a")[1].text
    point_ano = soup.select(".result .points")[1].text
    image_titles_ano = ', '.join([i['title'] for i in soup.select("[class='class'] > img")])

    print((title,point,image_titles),(title_ano,point_ano,image_titles_ano),arena)


for i in range(46270, 46394):  
    analyze(i)

To summarize, I scrape a couple of things:总而言之,我抓取了几件事:

  • Team names (title & title_ano)团队名称(title & title_ano)
  • Image titles (image_titles & image_titles_ano)图片标题(image_titles & image_titles_ano)
  • Team points (points & points_ano)团队积分(points & points_ano)
  • A string of text (arena)一串文字(arena)

One line of output currently looks like this: output 的一行目前看起来是这样的:

('Thunder', '0 pts', 'roublard, huppermage, ecaflip') ('Tweaps', '60 pts', 'steamer, feca, sacrieur') A10

My goal is to transfer this output to excel, making it look like this:我的目标是将这个 output 转移到 excel,让它看起来像这样:

我想要的

To clarify, in terms of the variables I have it would be this:澄清一下,就我拥有的变量而言,它是这样的: 就我的变量而言,我想要什么

Currently I can manage to transfer my data to excel, but I can't figure out how to format my data this way.目前我可以设法将我的数据传输到 excel,但我不知道如何以这种方式格式化我的数据。 Any help would be greatly appreciated:)任何帮助将不胜感激:)

First of all, the code that you are using is not actually wholly correct.首先,您使用的代码实际上并不完全正确。 Eg:例如:

analyze(46275)
(('Grind', '10 pts', 'roublard, ecaflip'), 
('SOLARY', '50 pts', 'enutrof, eniripsa, steamer, eliotrope'), 'A10')

Notice that the first player only has two image titles, and the second one has four .请注意,第一个玩家只有两个图像标题,第二个玩家有四个 This is incorrect, and happens because your code assumes that img tags with the class ending in "dead" belong to the first player, and the ones that have a class named "class" belong to the second.这是不正确的,因为您的代码假定class以“dead”结尾的img标签属于第一个玩家,而class名为“class”的 img 标签属于第二个。 This happens to be true for your first match (ie https://ktarena.com/fr/207-dofus-world-cup/match/46270 ), but very often this is not true at all.这恰好适用于您的第一场比赛(即https://ktarena.com/fr/207-dofus-world-cup/match/46270 ),但通常这根本不是真的。 Eg if I compare my result below with the same method applied to your analyze function, I end up with mismatches is 118 rows out of 248.例如,如果我将下面的结果与应用于您的analyze function 的相同方法进行比较,我最终得到的不匹配是 248 行中的 118 行。

Here's a suggested rewrite:这是一个建议的重写:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def analyze_new(i):
    # You don't need `/1` at the end of the url
    url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    
    arena = soup.find('span',class_='name').get_text()
    
    # find all teams, and look for info inside each team
    teams = soup.findAll('div',class_='team')
    my_teams = [tuple()]*2
    for idx, team in enumerate(teams):
        my_teams[idx] = my_teams[idx] + \
            (team.select(".name a")[0].get_text(),)
        my_teams[idx] = my_teams[idx] + \
            (soup.select(".result .points")[idx].get_text(),)
        my_teams[idx] = my_teams[idx] + \
            (', '.join([img['title'] for img in team.findAll('img')[1:]]),)

    # notice, we need `return` instead of `print` to use the data     
    return *my_teams,arena

print(analyze_new(46275))
(('Grind', '10 pts', 'roublard, ecaflip, enutrof'), 
('SOLARY', '50 pts', 'eniripsa, steamer, eliotrope'), 'A10')

Before writing this data to excel, I would create a pd.DataFrame , which can then be exported very easily:在将此数据写入 excel 之前,我会创建一个pd.DataFrame ,然后可以很容易地导出它:

# capture info per player in a single row
rows = []

for i in range(46270, 46394):
    one, two, arena = analyze_new(i)
    # adding `i` to rows, as "Match" seems like a useful `column` to have!
    # but if not, you can delete `i` here below (N.B. do NOT delete the COMMA!)
    # and cut 'Match' twice below
    rows.append(one+(arena,i))
    rows.append(two+(arena,i))

cols = ['Team','Points', 'Images', 'Arena','Match']

# create df
df = pd.DataFrame(data=rows,columns=cols)

# split up the images strings in `df.Images` and make new columns for them
# finally, drop the `df.Images` column itself
df = pd.concat([df,
                df.Images.str.split(',',expand=True)\
                    .rename(columns={i:f'Image Title {i+1}' 
                                     for i in range(3)})], axis=1)\
    .drop('Images', axis=1)

# Strip " pts" from the strings in `df.Points` and convert the type to an `int`
df['Points'] = df.Points.str.replace(' pts','').astype(int)

# Re-order the columns
df = df.loc[:, ['Match', 'Arena','Team', 'Image Title 1', 'Image Title 2', 
                'Image Title 3', 'Points']]

print(df.head())

   Match Arena         Team Image Title 1 Image Title 2 Image Title 3  Points
0  46270   A10      Thunder      roublard    huppermage       ecaflip       0
1  46270   A10       Tweaps       steamer          feca      sacrieur      60
2  46271   A10   Shadow Zoo          feca      osamodas       ouginak       0
3  46271   A10  UndisClosed      eniripsa          sram       pandawa      60
4  46272   A10   Laugh Tale      osamodas       ecaflip           iop       0

# Finally, write the `df` to an Excel file
df.to_excel('fname.xlsx')

Result:结果:

df_to_excel

If you dislike the default styles added to the header row and index column, you can write it away like so:如果你不喜欢添加到header行和index列的默认 styles,你可以像这样写掉它:

df.T.reset_index().T.to_excel('test.xlsx', index=False, header=False)

Result:结果:

df_to_excel_ex_header_index

Incidentally, I assume you have a particular reason for wanting the function to return the relevant data as *my_teams,arena .顺便说一句,我假设您有特殊原因希望 function 将相关数据返回为*my_teams,arena If not, it would be better to let the function itself do most of the heavy lifting.如果没有,最好让 function 自己完成大部分繁重的工作。 Eg we could write something like this, and return a df directly.例如,我们可以这样写,然后直接返回一个df

def analyze_dict(i):
    url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    
    d = {'Match': [i]*2,
               'Arena': [soup.find('span',class_='name').get_text()]*2,
               'Team': [],
               'Image Title 1': [],
               'Image Title 2': [],
               'Image Title 3': [],
               'Points': [],
               }
    
    teams = soup.findAll('div',class_='team')
    for idx, team in enumerate(teams):
        d['Team'].append(team.select(".name a")[0].get_text())
        d['Points'].append(int(soup.select(".result .points")[idx].get_text().split(' ')[0]))
        for img_idx, img in enumerate(team.findAll('img')[1:]):
            d[f'Image Title {img_idx+1}'].append(img['title'])
        
    return pd.DataFrame(d)

print(analyze_dict(46275))

   Match Arena    Team Image Title 1 Image Title 2 Image Title 3  Points
0  46275   A10   Grind      roublard       ecaflip       enutrof      10
1  46275   A10  SOLARY      eniripsa       steamer     eliotrope      50

Now, we only need to do the following outside the function:现在,我们只需要在 function 之外进行如下操作:

dfs = []

for i in range(46270, 46394):
    dfs.append(analyze_dict(i))

df = pd.concat(dfs, axis=0, ignore_index=True)

print(df.head())


   Match Arena         Team Image Title 1 Image Title 2 Image Title 3  Points
0  46270   A10      Thunder      roublard    huppermage       ecaflip       0
1  46270   A10       Tweaps       steamer          feca      sacrieur      60
2  46271   A10   Shadow Zoo          feca      osamodas       ouginak       0
3  46271   A10  UndisClosed      eniripsa          sram       pandawa      60
4  46272   A10   Laugh Tale      osamodas       ecaflip           iop       0

With hardly any changes from your post, you can use the openpyxl library to write the output to an excel file as shown below:您的帖子几乎没有任何变化,您可以使用openpyxl库将 output 写入 excel 文件,如下所示:

import requests
from openpyxl import Workbook
from bs4 import BeautifulSoup


def analyze(i):
    url = f"https://ktarena.com/fr/207-dofus-world-cup/match/{i}/1"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    arena = soup.find("span", attrs=('name')).text
    title = soup.select_one("[class='team'] .name a").text
    point = soup.select(".result .points")[0].text
    image_titles = image_titles = [i['title'] for i in soup.select("[class='team']:nth-of-type(1) [class^='class'] > img")]
    try:
        image_title_one = image_titles[0]
    except IndexError: image_title_one = ""
    try:
        image_title_two = image_titles[1]
    except IndexError: image_title_two = ""
    try:
        image_title_three = image_titles[2]
    except IndexError: image_title_three = ""
    
    ws.append([arena,title,image_title_one,image_title_two,image_title_three,point])
    
    title_ano = soup.select("[class='team'] .name a")[1].text
    point_ano = soup.select(".result .points")[1].text
    image_titles_ano = [i['title'] for i in soup.select("[class='team']:nth-of-type(2) [class^='class'] > img")]
    try:
        image_title_ano_one = image_titles_ano[0]
    except IndexError: image_title_ano_one = ""
    try:
        image_title_ano_two = image_titles_ano[1]
    except IndexError: image_title_ano_two = ""
    try:
        image_title_ano_three = image_titles_ano[2]
    except IndexError: image_title_ano_three = ""

    ws.append([arena,title_ano,image_title_ano_one,image_title_ano_two,image_title_ano_three,point_ano])
    print((title,point,image_titles),(title_ano,point_ano,image_titles_ano),arena)


if __name__ == '__main__':
    wb = Workbook()
    wb.remove(wb['Sheet'])
    ws = wb.create_sheet("result")
    ws.append(['Arena','Team','Image Title 1','Image Title 2','Image Title 3','Points'])
    for i in range(46270, 46290):  
        analyze(i)
    wb.save("output.xlsx")

I've fixed the selectors to grab the right number of image titles.我已经修复了选择器以获取正确数量的图像标题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM