简体   繁体   中英

Attempting to export parsed data to CSV file with Python and I can't figure out how to export more than one row

I'm fairly new to beautiful soup/Python/Web Scraping and I have been able to scrape data from a site, but I am only able to export the very first row to a csv file ( I want to export all scraped data into the file.)

I am stumped on how to make this code export ALL scraped data into multiple individual rows:

r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content  # Content of response
soup = BeautifulSoup(data, "html.parser")


for span in soup.find_all("span", {"class": "article"}):
    for link in span.select("a"):
           
        name_and_date = link.text.split('(')
        name = name_and_date[0].strip()
        date = name_and_date[1].replace(')','').strip()
        
        base_url = "https://www.infoplease.com"
        links = link['href']
        links = urljoin(base_url, links)
        
        
    
    pres_data = {'Name': [name],
                'Date': [date],
                'Link': [links]
                }
        
    df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])

    df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)

    print (df)

Any ideas here? I believe I need to loop it through the data parsing and grab each set and push it in. Am I going about this the right way?

Thanks for any insight

The way it is currently set up, it looks like you are not adding each link as a new entry and instead it is only adding the last link. If you initialize a list and add a dictionary like you have it set up for each iteration of the "links" for loop, you will add each row and not just the last one.

import pandas as pd 
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content  # Content of response
soup = BeautifulSoup(data, "html.parser")

pres_data = []
for span in soup.find_all("span", {"class": "article"}):
    for link in span.select("a"):
        
        name_and_date = link.text.split('(')
        name = name_and_date[0].strip()
        date = name_and_date[1].replace(')','').strip()
        
        base_url = "https://www.infoplease.com"
        links = link['href']
        links = urljoin(base_url, links)
        
        this_data = {'Name': name,
                    'Date': date,
                    'Link': links
                    }
        pres_data.append(this_data)
        
df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])

df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)

print (df)

You don't need to use Pandas here since you are not willing to apply any kind of Data operation there!

Usually try to limit yourself on the builtin libraries in case if the task is shorter.

import requests
from bs4 import BeautifulSoup
import csv


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'lxml')
    target = [([x.a['href']] + x.a.text[:-1].split(' ('))
              for x in soup.select('span.article')]
    with open('data.csv', 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['Url', 'Name', 'Date'])
        writer.writerows(target)


main('https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses')

Sample of output:

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM