简体   繁体   English

如何调用特定的锚标记并将其传递回Python Webscraper中的url?

[英]How to call a specific anchor tag and pass it back to the url in a Python webscraper?

I'm working on a problem for an online class, where I'm supposed to use BeautifulSoup to build a simple webscraper. 我正在处理在线课程的问题,我应该在该课程中使用BeautifulSoup构建一个简单的Webscraper。

Here is my progress so far: 到目前为止,这是我的进度:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

count = int(4)
position = int(3)

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'

html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('a', None)
for tag in tags:
    print(tag.get('href', None))

My question is this: How do I extract a particular anchor tag from the list of tags in tag? 我的问题是:如何从标签中的标签列表中提取特定的锚标签? Also, how can I make the for loop only iterate four times? 另外,如何使for循环仅迭代四次?

assignment details: 作业详细信息:

作业详细信息

Update: 更新:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

position = int(3)
count = int(4)

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')

for i in range(count):
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    print(tags[position])

So I can call a tag at a position this way, but I need to know how to iterate the tag at a position. 因此,我可以通过这种方式在某个位置调用标签,但是我需要知道如何在某个位置迭代标签。 As it is now, my program just prints the third link four times. 现在,我的程序只打印第三个链接四次。

Got it! 得到它了!

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

position = int(17)
count = int(7)

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')

for i in range(count):
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    url = soup('a')[position].get('href', None)
    print(url)

As you already know, tags = soup('a') produces quite a long list of links. 如您所知, tags = soup('a')产生了很长的链接列表。

You haven't said how you want to search for one of the links. 您尚未说过如何搜索链接之一。 I'll assume that you're selecting by name. 我假设您按名称进行选择。 Then here's how to search for Montgomery. 然后是如何搜索蒙哥马利。

>>> soup.find_all(string='Montgomery')
['Montgomery']

Once you've got that you can get the link ('a') element that contains 'Montgomery` in this way: 一旦知道了,就可以通过以下方式获取包含“蒙哥马利”的链接('a')元素:

>>> soup.find_all(string='Montgomery')[0].findParent()
<a href="http://py4e-data.dr-chuck.net/known_by_Montgomery.html">Montgomery</a>

Then you can get the attribute of the link element which is the actual url for Montgomery. 然后,您可以获取链接元素的属性,该属性是蒙哥马利的实际网址。

>>> soup.find_all(string='Montgomery')[0].findParent().attrs['href']
'http://py4e-data.dr-chuck.net/known_by_Montgomery.html'

One way of going through a loop at most four times: 一种最多循环四次的方法:

count = 0
for tag in tags:
    <do something>
    count += 1
    if count >= 4:
        break

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM