简体   繁体   English

使用BeautifulSoup遍历并检索特定的URL

[英]Use BeautifulSoup to loop through and retrieve specific URLs

I want to use BeautifulSoup and retrieve specific URLs at specific position repeatedly. 我想使用BeautifulSoup并在特定位置重复检索特定的URL。 You may imagine that there are 4 different URL lists each containing 100 different URL links. 您可能会想像有4个不同的URL列表,每个列表包含100个不同的URL链接。

I need to get and print always the 3rd URL on every list, while the previous URL (eg the 3rd URL on the first list) will lead to the 2nd list (and then need to get and print the 3rd URL and so on till the 4th retrieval). 我需要始终在每个列表上获取并打印第三个URL,而先前的URL(例如第一个列表上的第三个URL)将导致第二个列表(然后需要获取并打印第三个URL,依此类推,直到第四次检索)。

Yet, my loop only achieves the first result (3rd URL on list 1), and I don't know how to loop the new URL back to the while loop and continue the process. 但是,我的循环仅获得第一个结果(列表1中的第三个URL),我不知道如何将新URL循环回到while循环并继续该过程。

Here is my code: 这是我的代码:

import urllib.request
import json
import ssl
from bs4 import BeautifulSoup


num=int(input('enter count times: ' ))
position=int(input('enter position: ' ))

url='https://pr4e.dr-chuck.com/tsugi/mod/python-   
data/data/known_by_Fikret.html'
print (url)

count=0
order=0
while count<num:
    context = ssl._create_unverified_context()
    htm=urllib.request.urlopen(url, context=context).read()
    soup=BeautifulSoup(htm)
    for i in soup.find_all('a'):
        order+=1
        if order ==position:
            x=i.get('href')
            print (x)
    count+=1
    url=x        
print ('done')

This is a good problem to use recursion. 使用递归是一个好问题。 Try to call a recursive function to do this: 尝试调用一个递归函数来做到这一点:

def retrieve_urls_recur(url, position, index, deepness):
    if index >= deepness:
        return True
    else:
        plain_text = requests.get(url)
        soup = BeautifulSoup(plain_text)
        links = soup.find_all('a'):
        desired_link = links[position].get('href')
        print desired_link
        return retrieve_urls_recur(desired_link, index+1, deepness) 

and then call it with the desired parameters, in your case: 然后根据需要使用所需的参数进行调用:

retrieve_urls_recur(url, 2, 0, 4)

2 is the url index on the list of urls, 0 is the counter, and 4 is how deep you want to go recursively 2是网址列表上的url索引,0是计数器,而4是您要递归进行的深度

ps: I am using requests instead of urllib, and I didnt test this, although I recentely used a very similar function with sucess ps:我使用的是请求而不是urllib,虽然我最近对成功使用了非常相似的功能,但我没有对此进行测试

Just get the link from find_all() by index: 只需按索引从find_all()获得链接:

while count < num:
    context = ssl._create_unverified_context()
    htm = urllib.request.urlopen(url, context=context).read()

    soup = BeautifulSoup(htm)
    url = soup.find_all('a')[position].get('href')

    count += 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM