如何在Python中使用BeautifulSoup從HTML頁面提取表內容？

Question

我正在嘗試抓取以下URL ，到目前為止，我們已經能夠使用以下代碼來提取ul元素。

from bs4 import BeautifulSoup
import urllib
import csv
import requests
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
print(page_content.prettify())
page_content.ul

但是，我的目標是將表中包含的信息提取到csv文件中。 從當前代碼來看，我該如何做呢？

Answer 1

您可以使用python pandas庫將數據導入到csv中。 這是最簡單的方法。

import pandas as pd
tables=pd.read_html("https://repo.vse.gmu.edu/ait/AIT580/580books.html")
tables[0].to_csv("output.csv",index=False)

要安裝熊貓，只需使用

pip install pandas

Answer 2

使用列表推導的方法更簡潔：

import csv
import requests
from bs4 import BeautifulSoup

page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'

page_response = requests.get(page_link)
page_content = BeautifulSoup(page_response.content, "html.parser")

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    for items in page_content.find('table').find_all('tr'):
        data = [item.get_text(strip=True) for item in items.find_all(['th','td'])]
        print(data)
        writer.writerow(data)

Answer 3

盡管我認為KunduK的答案提供了一個使用pandas的優雅解決方案，但是我想給您提供另一種方法，因為您明確詢問了如何從當前代碼繼續（使用csv模塊和BeautifulSoup）。

from bs4 import BeautifulSoup
import csv
import requests

new_file = '/path/to/new/file.csv'
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
table = page_content.find('table')

for i,tr in enumerate(table.findAll('tr')):
    row = []
    for td in tr.findAll('td'):
        row.append(td.text)
    if i == 0: # write header
        with open(new_file, 'w') as f:
            writer = csv.DictWriter(f, row)
            writer.writeheader() # header
    else:
        with open(new_file, 'a') as f:
            writer = csv.writer(f)
            writer.writerow(row)

如您所見，我們首先獲取整個表，然后依次遍歷tr元素和td元素。 在迭代的第一輪（ tr ）中，我們將信息用作csv文件的標頭。 隨后，我們將所有信息作為行寫入csv文件。

如何在Python中使用BeautifulSoup從HTML頁面提取表內容？

問題描述

3 個解決方案

解決方案1
2 2019-07-22 14:40:03

解決方案2
1 2019-07-22 20:50:01

解決方案3
0 已采納 2019-07-22 14:55:41

如何在Python中使用BeautifulSoup從HTML頁面提取表內容？

問題描述

3 個解決方案

解決方案1 2 2019-07-22 14:40:03

解決方案2 1 2019-07-22 20:50:01

解決方案3 0 已采納 2019-07-22 14:55:41

解決方案1
2 2019-07-22 14:40:03

解決方案2
1 2019-07-22 20:50:01

解決方案3
0 已采納 2019-07-22 14:55:41