[英]retrieve links from web page using python and BeautifulSoup than select 3 link and run it 4 times
Here is code.这是代码。
import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print tag.get('href', None)
There are 18 links.有18个链接。 Now need to get position 3 means third link from the output and provide that link as input to html and run it again and do it 4 times.现在需要获取位置 3 意味着来自输出的第三个链接并将该链接作为输入提供给 html 并再次运行它并执行 4 次。 and what ever the last output at the position 3 than print out the name.以及在位置 3 处的最后输出比打印出名称。
[ https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html][1] [ https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html][1]
which will return 18 links from the above HTML.这将从上述 HTML 返回 18 个链接。 Now we need to select the 3 link and provide that 3rd link as an input to 'url' and follow the above loop for 4 times and what ever the last links comes out than get a name like in the first link 'fikret' is the name and what ever in the last link that is our output.现在我们需要选择第 3 个链接,并提供第 3 个链接作为 'url' 的输入,并按照上述循环进行 4 次,最后一个链接出现的内容比获得第一个链接中的名称 'fikret' 是名称以及最后一个链接中的内容是我们的输出。 Hope this helps.希望这可以帮助。 Thank you for looking into it.谢谢你的调查。
I was able to accomplish your homework in the following way (please take the time to learn this):我能够通过以下方式完成您的作业(请花点时间学习):
import urllib
from bs4 import BeautifulSoup
# This function will get the Nth link object from the given url.
# To be safe you should make sure the nth link exists (I did not)
def getNthLink(url, n):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
return tags[n-1]
url = "https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html"
# This iterates 4 times, each time grabbing the 3rd link object
# For convenience it prints the url each time.
for i in xrange(4):
tag = getNthLink(url,3)
url = tag.get('href')
print url
# Finally after 4 times we grab the content from the last tag
print tag.contents[0]
Easy way to do this is:做到这一点的简单方法是:
url = input('Enter url -')
count = int(input('Enter count -'))
position = int(input('Enter position-'))
for i in range(count):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
tag = tags[position - 1] #get the URL link from an array
url = tag.get('href',None)
print('Retrieving: ' + URL)
print(tag.contents[0])
I'm using Python 3.5 so here's the adapted version (and w/o def functions):我使用的是 Python 3.5,所以这里是调整后的版本(和 w/o def 函数):
import urllib
import bs4
import re
times = 7 #number of times to click url
line_numb = 18 #I'm converting to python count later
url = "http://python-data.dr-chuck.net/known_by_Terri.html "
for _ in range(times):
html = urllib.request.urlopen(url).read()
soup = bs4.BeautifulSoup(html, "lxml")
tags = soup.find_all('a')
url=tags[line_numb-1].get('href')
print (url)
name = re.findall('known_by_(.*).html',url)
print (name)
from urllib.request import urlopen
import BeautifulSoup
import ssl
ctx = ssl.create_default_context() # To ignore SSL certificate Error
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
count=int(input('Enter count: '))
pos=int(input('Enter position: '))
i=1
while True:
if i>count: # to run the loop at specified no of times.
break
i=i+1
html=urlopen(url, context=ctx).read() #to open the url
soup=BeautifulSoup(html,"html.parser")
tags=soup('a') #extracts list of anchor attributes
url=tags[pos-1].get('href',None) #extract tag at specified position . Indexing of attributes start with zero therefore pos-1
print (url)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.