简体   繁体   English

如何在Python中使用BeautifulSoup从HTML页面提取表内容?

[英]How to extract Table contents from an HTML page using BeautifulSoup in Python?

I am trying to scrape the following URL and so far have been able to use the following code to extract out the ul elements. 我正在尝试抓取以下URL ,到目前为止,我们已经能够使用以下代码来提取ul元素。

from bs4 import BeautifulSoup
import urllib
import csv
import requests
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
print(page_content.prettify())
page_content.ul

However, my goal is to extract the information contained within the table into a csv file. 但是,我的目标是将表中包含的信息提取到csv文件中。 How can I go about doing this judging from my current code? 从当前代码来看,我该如何做呢?

You can use python pandas library to import data into csv. 您可以使用python pandas库将数据导入到csv中。 Which is the easiest way to do that. 这是最简单的方法。

import pandas as pd
tables=pd.read_html("https://repo.vse.gmu.edu/ait/AIT580/580books.html")
tables[0].to_csv("output.csv",index=False)

To install pandas just use 要安装熊猫,只需使用

pip install pandas

Slightly cleaner approach using list comprehensions : 使用列表推导的方法更简洁

import csv
import requests
from bs4 import BeautifulSoup

page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'

page_response = requests.get(page_link)
page_content = BeautifulSoup(page_response.content, "html.parser")

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    for items in page_content.find('table').find_all('tr'):
        data = [item.get_text(strip=True) for item in items.find_all(['th','td'])]
        print(data)
        writer.writerow(data)

Although I think that KunduKs answer provides an elegant solution using pandas , I would like to give you another approach, since you explicitly asked how to go on from your current code (which is using the csv module and BeautifulSoup). 尽管我认为KunduK的答案提供了一个使用pandas的优雅解决方案,但是我想给您提供另一种方法,因为您明确询问了如何从当前代码继续(使用csv模块和BeautifulSoup)。

from bs4 import BeautifulSoup
import csv
import requests

new_file = '/path/to/new/file.csv'
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
table = page_content.find('table')

for i,tr in enumerate(table.findAll('tr')):
    row = []
    for td in tr.findAll('td'):
        row.append(td.text)
    if i == 0: # write header
        with open(new_file, 'w') as f:
            writer = csv.DictWriter(f, row)
            writer.writeheader() # header
    else:
        with open(new_file, 'a') as f:
            writer = csv.writer(f)
            writer.writerow(row)

As you can see, we first fetch the whole table and then iterate over the tr elements first and then the td elements. 如您所见,我们首先获取整个表,然后依次遍历tr元素和td元素。 In the first round of the iteration ( tr ), we use the information as a header for our csv file. 在迭代的第一轮( tr )中,我们将信息用作csv文件的标头。 Subsequently, we write all information as rows to the csv file. 随后,我们将所有信息作为行写入csv文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM