简体   繁体   中英

How do I use For Loop to get multiple links from an html?

This is what I have at the moment:

import bs4
import requests

def getXkcdComic(comicUrl):
    for i in range(0,20):
        res = requests.get(comicUrl + str(1882 - i))
        res.raise_for_status()

        soup = bs4.BeautifulSoup(res.text, 'html.parser')
        img = soup.select_one("div#comic > img")
        return str(img['src'])


link = getXkcdComic('https://xkcd.com/')

print(link)

I parses the html, gets one link, the first one, and since I know the url finishes at 1882 and the next I want is 1881, I wrote this for-loop to get the rest. It only prints one result as if there was not loop written. Strangely, if I reduce the indentation for the return function it returns a different url.

I didn't quite get how For-loops works yet. Also, this is my first post ever here so forgive my english and ignorance.

The first time you hit a return statement, the function is going to return, regardless of whether you're in a loop. So your for() loop is going to get to the end of the first iteration, see the return , and that's it. The other 19 iterations won't run.

The reason you get a different URL if you dedent the return is that your for() loop can now run to completion. But since you didn't save any of your previous iterations, it will return only the last one.

What it looks like you might want is to build a list of results, and return that.

def getXkcdComic(comicUrl):
    images = []                           # Create an empty list for results
    for i in range(0,20):
        res = requests.get(comicUrl + str(1882 - i))
        res.raise_for_status()
        soup = bs4.BeautifulSoup(res.text, 'html.parser')
        img = soup.select_one("div#comic > img")
        images.append(str(img['src']))    # Save the result by adding it to the list
    return images                         # Return the list

Just remember then that link in your outer scope will actually be a list of links, and handle it accordingly.

Your function returns control to the caller once it encounters the return statement, here in the first iteration of the for .

You can yield instead of return in your function to produce image links successively from the function and keep the for loop running:

import bs4
import requests

def getXkcdComic(comicUrl):
    for i in range(0,20):
        ...
        yield img['src']  # <- here

# make a list of links yielded by function
links = list(getXkcdComic('https://xkcd.com/')) 

References:

  1. Understanding Generators in Python

  2. Python yield expression

When you call 'return' during the first loop the entire 'getXkcdComic' function exits and returns.

Something like this may work and print like the original code intended:

import bs4
import requests

def getXkcdComic(comicUrl, number):
    res = requests.get(comicUrl + str(number))
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    return str(soup.select_one("div#comic > img")['src'])

link = 'https://xkcd.com/'
for i in range(20):
    print(getXkcdComic(link, 1882-i))

How do you expect to get multiple outputs (url here) with a single method call? The for loop helps you iterate over a range multiple times and get multiple results, but its of no use until you have a single call. You can do one of the following:

  • Instead of writing a loop inside the method, call the method in a loop. That way your output will be printed for each call.
  • Write the entire thing in the method so that you have multiple print statements.

Do the following:

def getXkcdComic(comicUrl):
    for i in range(0,20):
        res = requests.get(comicUrl + str(1882 - i))
        res.raise_for_status()
        soup = bs4.BeautifulSoup(res.text, 'html.parser')
        img = soup.select_one("div#comic > img")
        print str(img['src'])
getXkcdComic('https://xkcd.com/')

It happened because you make return in the loop. Try it:

def getXkcdComic(comicUrl):
    res = list()
    for i in range(0,20):
        res = requests.get(comicUrl + str(1882 - i))
        res.raise_for_status()

        soup = bs4.BeautifulSoup(res.text, 'html.parser')
        img = soup.select_one("div#comic > img")
        res.append(str(img['src']))
    return res

And you can change this:

for i in range(0,20):
            res = requests.get(comicUrl + str(1882 - i))

on this:

for i in range(1862, 1883, 1):
            res = requests.get(comicUrl + str(i))

The other answers are good and general, but for this specific case there's an even better way. xkcd provides a JSON API, so you can use a list comprehension:

def getXkcdComic(comicUrl):
    return [requests.get(comicUrl + str(1882 - i) + '/info.0.json').json()['img']
            for i in range(0,20)]

This is also faster and more friendly to the xkcd servers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM