简体   繁体   中英

How to extract data from all urls, not just the first

This script is generating a csv with the data from only one of the urls fed into it. There are meant to be 98 sets of results, however the for loop isn't getting past the first url.

I've been working on this for 12hrs+ today, what am I missing in order get the correct results?

import requests import re from bs4 import BeautifulSoup import csv

#Read csv
csvfile = open("gyms4.csv")
csvfilelist = csvfile.read()

def get_page_data(urls):
    for url in urls:
        r = requests.get(url.strip())
        soup = BeautifulSoup(r.text, 'html.parser')
        yield soup    # N.B. use yield instead of return

print r.text

with open("gyms4.csv") as url_file:
    for page in get_page_data(url_file):
        name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
        address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
        phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
        email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text

        th = pages.find('b',text="Category")
        td = th.findNext()
        for link in td.findAll('a',href=True):
            match = re.search(r'http://(\w+).(\w+).(\w+)', link.text)
            if match:
                web_address = link.text

gyms = [name,address,phoneNum,email,web_address]
gyms.append(gyms)

#Saving specific listing data to csv
with open ("xgyms.csv", "w") as file:
    writer = csv.writer(file)
    for row in gyms:
        writer.writerow([row])

You have 3 for-loops in your code and do not specifiy which one causes problem. I assume it is the one in get_page_date() function.

You leave the looop exactly in the first run with the return assignemt. That is why you never get to the second url.

There are at least two possible solutions:

  1. Append every parsed line of url to a list and return that list.
  2. Move you processing code in the loops and append the parsed data to gyms in the loop.

As Alex.S said, get_page_data() returns on the first iteration, hence subsequent URLs are never accessed. Furthermore, the code that extracts data from the page needs to be executed for each page downloaded, so it needs to be in a loop too. You could turn get_page_data() into a generator and then iterate over the pages like this:

def get_page_data(urls):
    for url in urls:
        r = requests.get(url.strip())
        soup = BeautifulSoup(r.text, 'html.parser')
        yield soup    # N.B. use yield instead of return

with open("gyms4.csv") as url_file:
    for page in get_page_data(url_file):
        name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
        address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
        phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
        email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text
        # etc. etc.

You can write the data to the CSV file as each page is downloaded and processed, or you can accumulate the data into a list and write it in one for with csv.writer.writerows() .

Also you should pass the URL list to get_page_data() rather than accessing it from a global variable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM