简体   繁体   English

如何使用For循环从html获取多个链接?

[英]How do I use For Loop to get multiple links from an html?

This is what I have at the moment: 这就是我现在所拥有的:

import bs4
import requests

def getXkcdComic(comicUrl):
    for i in range(0,20):
        res = requests.get(comicUrl + str(1882 - i))
        res.raise_for_status()

        soup = bs4.BeautifulSoup(res.text, 'html.parser')
        img = soup.select_one("div#comic > img")
        return str(img['src'])


link = getXkcdComic('https://xkcd.com/')

print(link)

I parses the html, gets one link, the first one, and since I know the url finishes at 1882 and the next I want is 1881, I wrote this for-loop to get the rest. 我解析html,得到一个链接,第一个,因为我知道网址在1882完成,而我想要的是1881年,我写了这个for-loop以获得其余的。 It only prints one result as if there was not loop written. 它只打印一个结果,好像没有写入循环。 Strangely, if I reduce the indentation for the return function it returns a different url. 奇怪的是,如果我减少return函数的缩进,它会返回一个不同的url。

I didn't quite get how For-loops works yet. 我还没有完全了解For-loops如何工作。 Also, this is my first post ever here so forgive my english and ignorance. 此外,这是我在这里的第一篇文章,请原谅我的英语和无知。

The first time you hit a return statement, the function is going to return, regardless of whether you're in a loop. 第一次点击return语句时,无论你是否处于循环中,该函数都将返回。 So your for() loop is going to get to the end of the first iteration, see the return , and that's it. 所以你的for()循环将在第一次迭代结束时看到return ,就是这样。 The other 19 iterations won't run. 其他19次迭代不会运行。

The reason you get a different URL if you dedent the return is that your for() loop can now run to completion. 你得到一个不同的URL,如果你得到return是你的for()循环现在可以运行完成。 But since you didn't save any of your previous iterations, it will return only the last one. 但由于您没有保存以前的任何迭代,它将只返回最后一个迭代。

What it looks like you might want is to build a list of results, and return that. 你可能想要的是建立一个结果列表,然后返回。

def getXkcdComic(comicUrl):
    images = []                           # Create an empty list for results
    for i in range(0,20):
        res = requests.get(comicUrl + str(1882 - i))
        res.raise_for_status()
        soup = bs4.BeautifulSoup(res.text, 'html.parser')
        img = soup.select_one("div#comic > img")
        images.append(str(img['src']))    # Save the result by adding it to the list
    return images                         # Return the list

Just remember then that link in your outer scope will actually be a list of links, and handle it accordingly. 只要记住那么link在你的外部范围实际上是一个链接列表 ,并相应地处理它。

Your function returns control to the caller once it encounters the return statement, here in the first iteration of the for . 你的函数控制返回给调用者一旦它遇到的return声明,在这里进行的第一次迭代。

You can yield instead of return in your function to produce image links successively from the function and keep the for loop running: 你可以yield的,而不是return你的函数从函数产生图像链接先后与保持for循环运行:

import bs4
import requests

def getXkcdComic(comicUrl):
    for i in range(0,20):
        ...
        yield img['src']  # <- here

# make a list of links yielded by function
links = list(getXkcdComic('https://xkcd.com/')) 

References: 参考文献:

  1. Understanding Generators in Python 理解Python中的生成器

  2. Python yield expression Python yield表达式

When you call 'return' during the first loop the entire 'getXkcdComic' function exits and returns. 当你在第一个循环中调用'return'时,整个'getXkcdComic'函数退出并返回。

Something like this may work and print like the original code intended: 像这样的东西可以工作和打印像原来的代码:

import bs4
import requests

def getXkcdComic(comicUrl, number):
    res = requests.get(comicUrl + str(number))
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    return str(soup.select_one("div#comic > img")['src'])

link = 'https://xkcd.com/'
for i in range(20):
    print(getXkcdComic(link, 1882-i))

How do you expect to get multiple outputs (url here) with a single method call? 您希望通过单个方法调用获得多个输出(此处为url)? The for loop helps you iterate over a range multiple times and get multiple results, but its of no use until you have a single call. for循环可帮助您多次迭代一个范围并获得多个结果,但在您进行一次调用之前它无用。 You can do one of the following: 您可以执行以下操作之一:

  • Instead of writing a loop inside the method, call the method in a loop. 而不是在方法内部编写循环,而是在循环中调用该方法。 That way your output will be printed for each call. 这样,您的输出将为每次通话打印。
  • Write the entire thing in the method so that you have multiple print statements. 在方法中写下整个内容,以便有多个print语句。

Do the following: 请执行下列操作:

def getXkcdComic(comicUrl):
    for i in range(0,20):
        res = requests.get(comicUrl + str(1882 - i))
        res.raise_for_status()
        soup = bs4.BeautifulSoup(res.text, 'html.parser')
        img = soup.select_one("div#comic > img")
        print str(img['src'])
getXkcdComic('https://xkcd.com/')

It happened because you make return in the loop. 它发生的原因是你在循环中return Try it: 试试吧:

def getXkcdComic(comicUrl):
    res = list()
    for i in range(0,20):
        res = requests.get(comicUrl + str(1882 - i))
        res.raise_for_status()

        soup = bs4.BeautifulSoup(res.text, 'html.parser')
        img = soup.select_one("div#comic > img")
        res.append(str(img['src']))
    return res

And you can change this: 你可以改变这个:

for i in range(0,20):
            res = requests.get(comicUrl + str(1882 - i))

on this: 对此:

for i in range(1862, 1883, 1):
            res = requests.get(comicUrl + str(i))

The other answers are good and general, but for this specific case there's an even better way. 其他答案是好的和一般的,但对于这个特定情况,有一个更好的方法。 xkcd provides a JSON API, so you can use a list comprehension: xkcd提供了一个JSON API,因此您可以使用列表推导:

def getXkcdComic(comicUrl):
    return [requests.get(comicUrl + str(1882 - i) + '/info.0.json').json()['img']
            for i in range(0,20)]

This is also faster and more friendly to the xkcd servers. 这对xkcd服务器来说也更快,更友好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM