[英]How to call a specific anchor tag and pass it back to the url in a Python webscraper?
I'm working on a problem for an online class, where I'm supposed to use BeautifulSoup to build a simple webscraper. 我正在处理在线课程的问题,我应该在该课程中使用BeautifulSoup构建一个简单的Webscraper。
Here is my progress so far: 到目前为止,这是我的进度:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
count = int(4)
position = int(3)
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('a', None)
for tag in tags:
print(tag.get('href', None))
My question is this: How do I extract a particular anchor tag from the list of tags in tag? 我的问题是:如何从标签中的标签列表中提取特定的锚标签? Also, how can I make the for loop only iterate four times?
另外,如何使for循环仅迭代四次?
assignment details: 作业详细信息:
Update: 更新:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
position = int(3)
count = int(4)
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
for i in range(count):
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
print(tags[position])
So I can call a tag at a position this way, but I need to know how to iterate the tag at a position. 因此,我可以通过这种方式在某个位置调用标签,但是我需要知道如何在某个位置迭代标签。 As it is now, my program just prints the third link four times.
现在,我的程序只打印第三个链接四次。
Got it! 得到它了!
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
position = int(17)
count = int(7)
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
for i in range(count):
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
url = soup('a')[position].get('href', None)
print(url)
As you already know, tags = soup('a')
produces quite a long list of links. 如您所知,
tags = soup('a')
产生了很长的链接列表。
You haven't said how you want to search for one of the links. 您尚未说过如何搜索链接之一。 I'll assume that you're selecting by name.
我假设您按名称进行选择。 Then here's how to search for Montgomery.
然后是如何搜索蒙哥马利。
>>> soup.find_all(string='Montgomery')
['Montgomery']
Once you've got that you can get the link ('a') element that contains 'Montgomery` in this way: 一旦知道了,就可以通过以下方式获取包含“蒙哥马利”的链接('a')元素:
>>> soup.find_all(string='Montgomery')[0].findParent()
<a href="http://py4e-data.dr-chuck.net/known_by_Montgomery.html">Montgomery</a>
Then you can get the attribute of the link element which is the actual url for Montgomery. 然后,您可以获取链接元素的属性,该属性是蒙哥马利的实际网址。
>>> soup.find_all(string='Montgomery')[0].findParent().attrs['href']
'http://py4e-data.dr-chuck.net/known_by_Montgomery.html'
One way of going through a loop at most four times: 一种最多循环四次的方法:
count = 0
for tag in tags:
<do something>
count += 1
if count >= 4:
break
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.