简体   繁体   中英

Scraping different variables from multiple URLs into one single CSV file using Python

I'm trying to scrape data from multiple URLs into one single csv file and it drives me crazy ;)

I do know that this is probably a common problem and that I'm not the first one trying to do this, but somehow I can't manage to apply the others' solutions on my code because they're not really "soup.find"-ing multiple variables one after each other like I do. My approach is proabably too basic.

So I started by grabing multiple stuff (let's go with name , job and worksfor ) from a single URL using BeautifulSoup and exporting it into a csv file and it works fine:

import urllib.request
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import csv

url = "https://www.someurl.com/person.asp?personId=123456789"

page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")

name = soup.find("h1", {"class": "name"}).get_text()
job = soup.find("span", {"itemprop": "jobTitle"}).get_text()
worksfor = soup.find("a", {"itemprop": "worksFor"}).get_text()

with open('output.csv', 'w') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
    spamwriter.writerow([name, job, worksfor])

Then I was looking up how to open multiple URLs saved in a file (urls.csv) and scraping (here:printing) for example the name. This here would deliver three names.

with open('urls.csv') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        site = urlopen(url)   
        soup = BeautifulSoup(site, "lxml")
        for name in soup.find("h1", {"class": "name"}):
            print(name)

This also works fine but I'm having a hard time trying to combine these two approaches into code that would deliver a csv file with one row (name; age; worksfor) for each URL from my urls.csv

Thank you so much for any suggestions


@SuperStew: Right so one of those approaches that at least didn't produce any errors was the following:

with open('urls.csv') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        site = urlopen(url)   
        soup = BeautifulSoup(site, "lxml")
        for name in soup.find("h1", {"class": "name"}):         
            with open('output.csv', 'w') as csvfile:
                spamwriter = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
                spamwriter.writerow([name, job, worksfor])

This always ends up with the CSV containing only those variables from the very last URL in my list, probably overwriting all the others.

Right so this looks fine, except for the last part where you write the results to csv. You basically rewrite the csv for each url, which means that only the last one will remain when your code is done. To avoid this, you can open your csv file in append mode, rather than write . Just a small change

with open('output.csv', 'a') as csvfile:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM