使用 python 和 BeautifulSoup 从网页中检索链接，而不是选择 3 个链接并运行 4 次

Question

这是代码。

import urllib
from BeautifulSoup import *

url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# Retrieve all of the anchor tags

tags = soup('a')
for tag in tags:
    print tag.get('href', None)

有18个链接。 现在需要获取位置 3 意味着来自输出的第三个链接并将该链接作为输入提供给 html 并再次运行它并执行 4 次。 以及在位置 3 处的最后输出比打印出名称。

[ https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html][1]

这将从上述 HTML 返回 18 个链接。 现在我们需要选择第 3 个链接，并提供第 3 个链接作为 'url' 的输入，并按照上述循环进行 4 次，最后一个链接出现的内容比获得第一个链接中的名称 'fikret' 是名称以及最后一个链接中的内容是我们的输出。 希望这可以帮助。 谢谢你的调查。

Answer 1

我能够通过以下方式完成您的作业（请花点时间学习）：

import urllib
from bs4 import BeautifulSoup

# This function will get the Nth link object from the given url.
# To be safe you should make sure the nth link exists (I did not)
def getNthLink(url, n):
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    return tags[n-1]

url = "https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html"

# This iterates 4 times, each time grabbing the 3rd link object
# For convenience it prints the url each time.
for i in xrange(4):
    tag = getNthLink(url,3)
    url = tag.get('href')
    print url

# Finally after 4 times we grab the content from the last tag
print tag.contents[0]

Answer 2

做到这一点的简单方法是：

url = input('Enter url -')
count = int(input('Enter count -'))
position = int(input('Enter position-'))

for i in range(count):
   html = urllib.request.urlopen(url).read()
   soup = BeautifulSoup(html, 'html.parser')
   tags = soup('a')
   tag = tags[position - 1]  #get the URL link from an array
   url = tag.get('href',None)
   print('Retrieving: ' + URL)


print(tag.contents[0])

Answer 3

我使用的是 Python 3.5，所以这里是调整后的版本（和 w/o def 函数）：

import urllib
import bs4
import re

times = 7 #number of times to click url
line_numb = 18 #I'm converting to python count later
url = "http://python-data.dr-chuck.net/known_by_Terri.html "

for _ in range(times):

    html = urllib.request.urlopen(url).read()
    soup = bs4.BeautifulSoup(html, "lxml")
    tags = soup.find_all('a')
    url=tags[line_numb-1].get('href')

print (url)
name = re.findall('known_by_(.*).html',url)
print (name)

Answer 4

from urllib.request import urlopen
import BeautifulSoup
import ssl

ctx = ssl.create_default_context()       # To ignore SSL certificate Error
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE


url = input('Enter - ')

count=int(input('Enter count: '))

pos=int(input('Enter position: '))


i=1
while True:
   if i>count:                                 # to run the loop at specified no of times.
      break

   i=i+1

   html=urlopen(url, context=ctx).read()               #to open the url
   soup=BeautifulSoup(html,"html.parser")
   tags=soup('a')                                  #extracts list of anchor attributes
   url=tags[pos-1].get('href',None)                        #extract tag at specified position . Indexing of attributes start with zero therefore pos-1

   print (url)

使用 python 和 BeautifulSoup 从网页中检索链接，而不是选择 3 个链接并运行 4 次

问题描述

4 个解决方案

解决方案1
1 已采纳 2015-11-25 04:51:43

解决方案2
1 2020-11-11 11:12:47

解决方案3
0 2016-10-30 16:33:01

解决方案4
0 2018-01-07 08:23:43

使用 python 和 BeautifulSoup 从网页中检索链接，而不是选择 3 个链接并运行 4 次

问题描述

4 个解决方案

解决方案1 1 已采纳 2015-11-25 04:51:43

解决方案2 1 2020-11-11 11:12:47

解决方案3 0 2016-10-30 16:33:01

解决方案4 0 2018-01-07 08:23:43

解决方案1
1 已采纳 2015-11-25 04:51:43

解决方案2
1 2020-11-11 11:12:47

解决方案3
0 2016-10-30 16:33:01

解决方案4
0 2018-01-07 08:23:43