简体   繁体   English

BeautifulSoup - 在多个页面上抓取 html 表格

[英]BeautifulSoup - Scraping html table on multipe pages

I'm newbie on Python and BeautifulSoup, I would like to scrape multiple pages in csv but when I'm trying to store those 3 links only the last one it's stored in csv.我是 Python 和 BeautifulSoup 的新手,我想在 csv 中抓取多个页面,但是当我尝试存储这 3 个链接时,它只存储在 csv 中的最后一个。

How can I fix my issue ?我该如何解决我的问题?

## importing bs4, requests, fake_useragent and csv modules
from bs4 import BeautifulSoup
import requests
from fake_useragent import UserAgent
import csv

## create an array with URLs
urls = [
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=750300360&editable_length=10',
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=030780118&editable_length=10',
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=620103432&editable_length=10'
]

## initializing the UserAgent object
user_agent = UserAgent()

## starting the loop
for url in urls:
    ## getting the reponse from the page using get method of requests module
    page = requests.get(url, headers={"user-agent": user_agent.chrome})

    ## storing the content of the page in a variable
    html = page.content

    ## creating BeautifulSoup object
    soup = BeautifulSoup(html, "html.parser")
    table = soup.findAll("table", {"class":"table"})[0]
    rows = table.findAll("tr")

with open("test.csv", "wt+", newline="") as f:
    writer = csv.writer(f)
    for row in rows:
        csv_row = []
        for cell in row.findAll(["td", "th"]):
            csv_row.append(cell.get_text())
        writer.writerow(csv_row)

Thanks a lot !非常感谢 !

To simplify the reading process of the rows, you could also give a shot with pandas :为了简化行的读取过程,您还可以使用pandas

import csv
import requests
from bs4 import BeautifulSoup
import pandas as pd


urls = [
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=750300360&editable_length=10',
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=030780118&editable_length=10',
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=620103432&editable_length=10'
]

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}

all_data = []
for url in urls:
    page = requests.get(url, headers=headers)

    soup = BeautifulSoup(page.content, "html.parser")
    table = soup.findAll("table", {"class":"table"})[0]
    
    df_table = pd.read_html(str(table))[0]
    
    #add a column with additional info
    df_table['hit'] = soup.find("span", {"class":"c"}).text.strip() 
    
    #store the table in a list of tables
    all_data.append(df_table)

#concat the tables and export them to csv
pd.concat(all_data).to_csv('test.csv',index=False)

In your code, you don't store rows variable to anywhere, so you write only values from your last URL to CSV file.在您的代码中,您不会将rows变量存储到任何地方,因此您只将上一个 URL 中的值写入 CSV 文件。 This example will write values from all three URLs:此示例将写入来自所有三个 URL 的值:

import csv
import requests
from bs4 import BeautifulSoup


urls = [
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=750300360&editable_length=10',
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=030780118&editable_length=10',
'https://www.scansante.fr/applications/casemix_ghm_cmd/submit?snatnav=&typrgp=etab&annee=2019&type=ghm&base=0&typreg=noreg2016&noreg=99&finess=620103432&editable_length=10'
]

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}

all_data = []
for url in urls:
    page = requests.get(url, headers=headers)

    soup = BeautifulSoup(page.content, "html.parser")
    table = soup.findAll("table", {"class":"table"})[0]

    # here I store all rows to list `all_data`
    for row in table.findAll('tr'):
        tds = [cell.get_text(strip=True, separator=' ') for cell in row.findAll(["td", "th"])]
        all_data.append(tds)
        print(*tds)

# write list `all_data` to CSV
with open("test.csv", "wt+", newline="") as f:
    writer = csv.writer(f)
    for row in all_data:
        writer.writerow(row)

Writes test.csv from all three URLs (screenshot from LibreOffice):从所有三个 URL 写入test.csv (来自 LibreOffice 的屏幕截图):

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM