尝试使用 Python 将解析的数据导出到 CSV 文件，但我不知道如何导出超过一行

Question

我对漂亮的汤/Python/Web Scraping 很陌生，我已经能够从网站上抓取数据，但我只能将第一行导出到 csv 文件（我想将所有抓取的数据导出到文件。）

我对如何使此代码将所有抓取的数据导出到多个单独的行感到困惑：

r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content  # Content of response
soup = BeautifulSoup(data, "html.parser")


for span in soup.find_all("span", {"class": "article"}):
    for link in span.select("a"):
           
        name_and_date = link.text.split('(')
        name = name_and_date[0].strip()
        date = name_and_date[1].replace(')','').strip()
        
        base_url = "https://www.infoplease.com"
        links = link['href']
        links = urljoin(base_url, links)
        
        
    
    pres_data = {'Name': [name],
                'Date': [date],
                'Link': [links]
                }
        
    df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])

    df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)

    print (df)

这里有什么想法吗？ 我相信我需要通过数据解析循环它并抓住每一组并将其推入。我这样做的方式是否正确？

感谢您的任何见解

Answer 1

它当前的设置方式，看起来您没有将每个链接添加为新条目，而是仅添加最后一个链接。 如果您初始化一个列表并添加一个字典，就像为“链接”for 循环的每次迭代设置它一样，您将添加每一行，而不仅仅是最后一行。

import pandas as pd 
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content  # Content of response
soup = BeautifulSoup(data, "html.parser")

pres_data = []
for span in soup.find_all("span", {"class": "article"}):
    for link in span.select("a"):
        
        name_and_date = link.text.split('(')
        name = name_and_date[0].strip()
        date = name_and_date[1].replace(')','').strip()
        
        base_url = "https://www.infoplease.com"
        links = link['href']
        links = urljoin(base_url, links)
        
        this_data = {'Name': name,
                    'Date': date,
                    'Link': links
                    }
        pres_data.append(this_data)
        
df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])

df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)

print (df)

Answer 2

你不需要在这里使用Pandas因为你不愿意在那里应用任何类型的Data操作！

如果任务更短，通常会尝试限制自己使用内置库。

import requests
from bs4 import BeautifulSoup
import csv


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    target = [([x.a['href']] + x.a.text[:-1].split(' ('))
              for x in soup.select('span.article')]
    with open('data.csv', 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['Url', 'Name', 'Date'])
        writer.writerows(target)


main('https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses')

output样品：

尝试使用 Python 将解析的数据导出到 CSV 文件，但我不知道如何导出超过一行

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-02-23 00:28:02

解决方案2
0 2021-02-23 00:43:23

尝试使用 Python 将解析的数据导出到 CSV 文件，但我不知道如何导出超过一行

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-02-23 00:28:02

解决方案2 0 2021-02-23 00:43:23

解决方案1
1 已采纳 2021-02-23 00:28:02

解决方案2
0 2021-02-23 00:43:23