简体   繁体   中英

Python BeautifulSoup accounting for missing data on website when writing to csv

I am practicing my web scraping skills on the following website: " http://web.californiacraftbeer.com/Brewery-Member "

The code I have so far is below. I'm able to grab the fields that I want and write the information to CSV, but the information in each row does not match the actual company details. For example, Company A has the contact name for Company D and the phone number for Company E in the same row.

Since some data does not exist for certain companies, how can I account for this when writing rows that should be separated per company to CSV? What is the best way to make sure that I am grabbing the correct information for the correct companies when writing to CSV?

"""
Grabs brewery name, contact person, phone number, website address, and email address 
for each brewery listed.
"""    

import requests, csv
from bs4 import BeautifulSoup    

url = "http://web.californiacraftbeer.com/Brewery-Member"
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
company_name = soup.find_all(itemprop="name")
contact_name = soup.find_all("div", {"class": "ListingResults_Level3_MAINCONTACT"})
phone_number = soup.find_all("div", {"class": "ListingResults_Level3_PHONE1"})
website = soup.find_all("span", {"class": "ListingResults_Level3_VISITSITE"})    

def scraper():
    """Grabs information and writes to CSV"""
    print("Running...")
    results = []
    count = 0
    for company, name, number, site in zip(company_name, contact_name, phone_number, website):
        print("Grabbing {0} ({1})...".format(company.text, count))
        count += 1
        newrow = []
        try:
            newrow.append(company.text)
            newrow.append(name.text)
            newrow.append(number.text)
            newrow.append(site.find('a')['href'])
        except Exception as e: 
            error_msg = "Error on {0}-{1}".format(number.text,e) 
            newrow.append(error_msg)
        results.append(newrow)
    print("Done")
    outFile = open("brewery.csv","w")
    out = csv.writer(outFile, delimiter=',',quoting=csv.QUOTE_ALL, lineterminator='\n')
    out.writerows(results)
    outFile.close()

def main():
    """Runs web scraper"""
    scraper()    

if __name__ == '__main__':
    main()

Any help is very much appreciated!

您需要使用一个zip来同时遍历所有这些数组:

for company, name, number, site in zip(company_name, contact_name, phone_number, website):

Thanks for the help.

I realized that since the company details for each company are contained in the Div class "ListingResults_All_CONTAINER ListingResults_Level3_CONTAINER", I could write a nested for-loop that iterates through each of these Divs and then grabs the information I want within the Div.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM