简体   繁体   English

使用BeautifulSoup从网页检索链接

[英]Retrieve links from web page using BeautifulSoup

I am trying to pull links from a webpage at a certain position, then open that link, and then repeat that process for the provided number of times. 我试图从某个位置的网页上提取链接,然后打开该链接,然后重复该过程指定的次数。 The problem is I keep getting the same URL returned, so it seems like my code is just pulling the tag, printing the tag, not opening it, and doing that process X number of times before closing. 问题是我不断返回相同的URL,因此看来我的代码只是拉动标签,打印标签,不打开它,并在关闭前进行X次该过程。

I have written and re-written this code a number of times, but for the life of me I just can't figure it out. 我已经多次编写并重新编写了这段代码,但是对于我一生来说,我只是无法弄清楚。 Please tell me what I am doing wrong 请告诉我我做错了

Tried using list to put anchor tags in, then open the url at the requested position in the list, then do a list clear before starting the loop over again. 尝试使用list放置锚标记,然后在列表中请求的位置打开url,然后清除列表,然后再次开始循环。

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#url = input('Enter - ')
url = "http://py4e-data.dr-chuck.net/known_by_Fikret.html"
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

count = 0 
url_loop = int(input("Enter how many times to loop through: ")) 
url_pos= int(input("Enter position of URL: "))
url_pos = url_pos - 1

print(url_pos)



# Retrieve all of the anchor tags
tags = soup('a')
while True:
    if url_loop == count:
        break
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    url = tags[url_pos].get('href', None)

    print("Acquiring URL: ", url)

    count = count + 1  

print("final URL:", url)

it could be that the tags are only extracted once for the initial document: 对于初始文档,标签可能只提取了一次:

# Retrieve all of the anchor tags
tags = soup('a')

If you were to re-extract the tags after fetching each document, they would reflect the last document. 如果要在提取每个文档后重新提取标签,它们将反映最后一个文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM