简体   繁体   English

HTML Scraping with Beautiful Soup-多余的换行符

[英]HTML Scraping with Beautiful Soup - Unwanted line breaks

I've been trying to write a script to get data from a html page and save it to .csv file. 我一直在尝试编写脚本以从html页面获取数据并将其保存到.csv文件。 However I've run into 3 minor problems. 但是我遇到了3个小问题。

First of all, when saving to .csv I get some unwanted line breaks which mess up the output file. 首先,当保存为.csv时,会出现一些不必要的换行符,使输出文件混乱。

Secondly, players' names (the data concerns NBA players) appear twice. 其次,球员姓名(数据涉及NBA球员)出现两次。

from bs4 import BeautifulSoup
import requests
import time


teams = ['atlanta-hawks', 'boston-celtics', 'brooklyn-nets']

seasons = []
a=2018
while (a>2016):
    seasons.append(str(a))
    a-=1
print(seasons)  
for season in seasons:

    for team in teams:
        my_url = ' https://www.spotrac.com/nba/'+team+'/cap/'+ season +'/'

        headers = {"User-Agent" : "Mozilla/5.0"}

        response = requests.get(my_url)
        response.content

        soup = BeautifulSoup(response.content, 'html.parser')

        stat_table = soup.find_all('table', class_ = 'datatable')


        my_table = stat_table[0]

        plik = team + season + '.csv'   
        with open (plik, 'w') as r:
            for row in my_table.find_all('tr'):
                for cell in row.find_all('th'):
                    r.write(cell.text)
                    r.write(";")

            for row in my_table.find_all('tr'):
                for cell in row.find_all('td'): 
                    r.write(cell.text)
                    r.write(";")

Also, some of the numbers that are seperated by "." 另外,某些数字以“。”分隔。 are being automatically converted to dates. 被自动转换为日期。

Any ideas how I could solve those problems? 有什么想法可以解决这些问题吗?

Screenshot of output file 输出文件的屏幕截图

I made a few changes to your script. 我对您的脚本进行了一些更改。 To build the URLs, I'm using string interpolation (instead of concatenation). 为了构建URL,我使用字符串插值(而不是串联)。 To get rid of the extra whitespace, I'm using the strip() method that is defined on strings. 为了摆脱多余的空格,我使用了在字符串上定义的strip()方法。 When it comes to the duplication of names, I selected the <a> tag, then called .text on the BeautifulSoup selector. 当涉及到名称重复时,我选择了<a>标记,然后在BeautifulSoup选择器上将其命名为.text

# pip install beautifulsoup4
# pip install requests

from bs4 import BeautifulSoup
import requests
import time

teams = ['atlanta-hawks', 'boston-celtics', 'brooklyn-nets']
seasons = [2018, 2017]

for season in seasons:
    for team in teams:
        my_url = f'https://www.spotrac.com/nba/{team}/cap/{season}/'
        headers = {"User-Agent": "Mozilla/5.0"}

        response = requests.get(my_url)
        response.content

        soup = BeautifulSoup(response.content, 'html.parser')
        stat_table = soup.find_all('table', class_ = 'datatable')
        my_table = stat_table[0]

        csv_file = f'{team}-{season}.csv'
        with open(csv_file, 'w') as r:
            for row in my_table.find_all('tr'):
                for cell in row.find_all('th'):
                    r.write(cell.text.strip())
                    r.write(";")

                for i, cell in enumerate(row.find_all('td')):
                    if i == 0:
                        r.write(cell.a.text.strip())
                    else:
                        r.write(cell.text.strip())
                    r.write(";")
                r.write("\n")

When it comes to Excel converting numbers like 1.31 to dates, that's Excel trying to be smart, and failing. 当谈到Excel将1.31数字转换为日期时,这就是Excel试图变得聪明而失败的原因。 I think when you go to import a CSV, you can choose what column types to use for the data. 我认为当您导入CSV时,您可以选择要用于数据的列类型。 Check out this guide . 查看本指南

Richard provided a complete answer that works for the 3.6 + versions. Richard提供了适用于3.6及更高版本的完整答案。 It executes file.write() for every cell, though, which is not necessary, so here's an alternative with str.format() which will work for python versions before 3.6, and writes once per row: 不过,它会为每个单元格执行file.write() ,但这不是必需的,因此这是str.format()的替代方法,它将适用于3.6之前的python版本,并且每行写入一次:

from bs4 import BeautifulSoup
import requests
import time

teams = ['atlanta-hawks', 'boston-celtics', 'brooklyn-nets']
seasons = [2018, 2017]

for season in seasons:
    for team in teams:
        my_url = 'https://www.spotrac.com/nba/{}/cap/{}/'.format(team, season)
        headers = {"User-Agent": "Mozilla/5.0"}

        response = requests.get(my_url)
        response.content

        soup = BeautifulSoup(response.content, 'html.parser')
        stat_table = soup.find_all('table', class_ = 'datatable')
        my_table = stat_table[0]

        csv_file = '{}-{}.csv'.format(team, season)
        with open(csv_file, 'w') as r:
            for row in my_table.find_all('tr'):
                row_string = ''

                for cell in row.find_all('th'):
                    row_string='{}{};'.format(row_string, cell.text.strip())

                for i, cell in enumerate(row.find_all('td')):
                    cell_string = cell.a.text if i==0 else cell.text
                    row_string='{}{};'.format(row_string, cell_string)

                r.write("{}\n".format(row_string))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM