简体   繁体   English

使用链接数据路径访问使用 python 网络抓取的链接

[英]Using a link data path to access a link that is web-scraped using python

The output seems to be a data path for a link. output 似乎是链接的数据路径。 I am not sure given I am very new to web scraping and python. How do I use the output to access the link that it is directed to using python?我不确定,因为我是 web 抓取和 python 的新手。如何使用 output 访问指向使用 python 的链接?

Page Source information页面源信息

在此处输入图像描述

import urllib
import urllib.request
from bs4 import BeautifulSoup


def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, 'lxml')
    return soupdata


soup = make_soup(
    "https://www.bahamas.gov.bs/wps/portal/public/!ut/p/b1/vZTbkqo4FIafpR-AJuHsJQKikASQBIUby7OiNAooh6cf9lRXTe2e6nYu9pB1lcq_6sv610r4hF_yycf6eT6uq3P-sb7-2ifKSgQ21nVJw55AR2AGDai4TAaaqfSCuBeAb5YOfs-3ZaCAGfOR7qtTwXMFfsEvp3GBjBzXlm6snHJf38UsiZuJizgnQS1y8nMmmcvlaipAv65rE2utc9vm3kVzTgdDHZ2kKJ44682y3chysjMemFM2N3pD4-cKJQwzAe8uz8k4bh60Kq14mnjdoyDciCqWYqQm2IDDBLqkfVzePuv54cJ_12N7ODD6Y02RfvlBycy0fNEW4df8r4ZJ_zH_e8GLfiz45EfLJfAp-KllryAvBBJPpnm25-Nepn4rGwOe8ksgrcK0vc26SzdPQePiLpoR02oBYnXIOkao1bePkDINWkAjJ7zITsgaTGggVBX1d9GcjXVjfJXm6c_AECnDArUxHBqo_u9AVxB7IBIR0QwY2uLAQO3PD43DJ-dN9l5vs3fwro1EYSSrQJM0KEMA-SiNlcYsZ7VlkdVpzy42ybcbZDRabV8NUNCrNXe310lLt8-D1d6yxqbqfjOeOJxie5Gk68nuut6LxwVg-63PHgk24rUebB1WcfNAyF2ke8gvSGHUSmCtqKSV482OMTLlan3ZhPfd5Bmt0706D65GgWLiHXarnE63DWbX2JdULTufmpbQZ-TeMA2jBEwTuGDhIS-5c5oQM8EOKg57Rd0WGVe11aN5MUbmv8eoCwCmvck07mDH6vLSLXqTRVKtITVjCDasxqwEuMNi1c0J3U0_TTZ989g-Xr1MaVigRpWhgeLAQBEMDRSGBsoDA52hp1T88-_wt89OBhAKItRkQVQVVeCjHhWkWp27YVazNApIeHyi0QFRXEShpvoRXeQtFh0XxtA1UpdCC4bWpQzD9oasU0RY6nVt5XThc57Ws869FH05ATn44k66gZN0z-2MGYprcHBs0DJf3I8fsve460eduI8Q1m9v_C1jT9dFytw6mMvun9ivua8R_wVxbY3o/dl4/d5/L2dBISEvZ0FBIS9nQSEh/")

link = soup.find('div', {"class": "module"}).findAll('a')[0]
url_req = link.get('href')
print(url_req)

Output: Output:

?1dmy&urile=wcm%3apath%3a/mof_content/internet/moh/government/news+and+press+release/covid-19+report+update+227

When scraping an href, the initial link need to be combined with the output抓取 href 时,初始链接需要与 output 组合

import urllib
import urllib.request
from bs4 import BeautifulSoup


def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, 'lxml')
    return soupdata


soup = make_soup(
    "https://www.bahamas.gov.bs/wps/portal/public/!ut/p/b1/vZTbkqo4FIafpR-AJuHsJQKikASQBIUby7OiNAooh6cf9lRXTe2e6nYu9pB1lcq_6sv610r4hF_yycf6eT6uq3P-sb7-2ifKSgQ21nVJw55AR2AGDai4TAaaqfSCuBeAb5YOfs-3ZaCAGfOR7qtTwXMFfsEvp3GBjBzXlm6snHJf38UsiZuJizgnQS1y8nMmmcvlaipAv65rE2utc9vm3kVzTgdDHZ2kKJ44682y3chysjMemFM2N3pD4-cKJQwzAe8uz8k4bh60Kq14mnjdoyDciCqWYqQm2IDDBLqkfVzePuv54cJ_12N7ODD6Y02RfvlBycy0fNEW4df8r4ZJ_zH_e8GLfiz45EfLJfAp-KllryAvBBJPpnm25-Nepn4rGwOe8ksgrcK0vc26SzdPQePiLpoR02oBYnXIOkao1bePkDINWkAjJ7zITsgaTGggVBX1d9GcjXVjfJXm6c_AECnDArUxHBqo_u9AVxB7IBIR0QwY2uLAQO3PD43DJ-dN9l5vs3fwro1EYSSrQJM0KEMA-SiNlcYsZ7VlkdVpzy42ybcbZDRabV8NUNCrNXe310lLt8-D1d6yxqbqfjOeOJxie5Gk68nuut6LxwVg-63PHgk24rUebB1WcfNAyF2ke8gvSGHUSmCtqKSV482OMTLlan3ZhPfd5Bmt0706D65GgWLiHXarnE63DWbX2JdULTufmpbQZ-TeMA2jBEwTuGDhIS-5c5oQM8EOKg57Rd0WGVe11aN5MUbmv8eoCwCmvck07mDH6vLSLXqTRVKtITVjCDasxqwEuMNi1c0J3U0_TTZ989g-Xr1MaVigRpWhgeLAQBEMDRSGBsoDA52hp1T88-_wt89OBhAKItRkQVQVVeCjHhWkWp27YVazNApIeHyi0QFRXEShpvoRXeQtFh0XxtA1UpdCC4bWpQzD9oasU0RY6nVt5XThc57Ws869FH05ATn44k66gZN0z-2MGYprcHBs0DJf3I8fsve460eduI8Q1m9v_C1jT9dFytw6mMvun9ivua8R_wVxbY3o/dl4/d5/L2dBISEvZ0FBIS9nQSEh/")

link = soup.find('div', {"class": "module"}).findAll('a')[0]
temp = link.get('href')
print(temp)

url_req = "https://www.bahamas.gov.bs/wps/portal/public/!ut/p/b1/vZTbkqo4FIafpR-AJuHsJQKikASQBIUby7OiNAooh6cf9lRXTe2e6nYu9pB1lcq_6sv610r4hF_yycf6eT6uq3P-sb7-2ifKSgQ21nVJw55AR2AGDai4TAaaqfSCuBeAb5YOfs-3ZaCAGfOR7qtTwXMFfsEvp3GBjBzXlm6snHJf38UsiZuJizgnQS1y8nMmmcvlaipAv65rE2utc9vm3kVzTgdDHZ2kKJ44682y3chysjMemFM2N3pD4-cKJQwzAe8uz8k4bh60Kq14mnjdoyDciCqWYqQm2IDDBLqkfVzePuv54cJ_12N7ODD6Y02RfvlBycy0fNEW4df8r4ZJ_zH_e8GLfiz45EfLJfAp-KllryAvBBJPpnm25-Nepn4rGwOe8ksgrcK0vc26SzdPQePiLpoR02oBYnXIOkao1bePkDINWkAjJ7zITsgaTGggVBX1d9GcjXVjfJXm6c_AECnDArUxHBqo_u9AVxB7IBIR0QwY2uLAQO3PD43DJ-dN9l5vs3fwro1EYSSrQJM0KEMA-SiNlcYsZ7VlkdVpzy42ybcbZDRabV8NUNCrNXe310lLt8-D1d6yxqbqfjOeOJxie5Gk68nuut6LxwVg-63PHgk24rUebB1WcfNAyF2ke8gvSGHUSmCtqKSV482OMTLlan3ZhPfd5Bmt0706D65GgWLiHXarnE63DWbX2JdULTufmpbQZ-TeMA2jBEwTuGDhIS-5c5oQM8EOKg57Rd0WGVe11aN5MUbmv8eoCwCmvck07mDH6vLSLXqTRVKtITVjCDasxqwEuMNi1c0J3U0_TTZ989g-Xr1MaVigRpWhgeLAQBEMDRSGBsoDA52hp1T88-_wt89OBhAKItRkQVQVVeCjHhWkWp27YVazNApIeHyi0QFRXEShpvoRXeQtFh0XxtA1UpdCC4bWpQzD9oasU0RY6nVt5XThc57Ws869FH05ATn44k66gZN0z-2MGYprcHBs0DJf3I8fsve460eduI8Q1m9v_C1jT9dFytw6mMvun9ivua8R_wVxbY3o/dl4/d5/L2dBISEvZ0FBIS9nQSEh/" + temp
print(url_req)

you can use urllib.parse.urljoin to get the absolute link from the relative link.您可以使用urllib.parse.urljoin从相对链接中获取绝对链接。

import urllib
import urllib.request
from bs4 import BeautifulSoup


def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, 'lxml')
    return soupdata

web_url = "https://www.bahamas.gov.bs/wps/portal/public/!ut/p/b1/vZTbkqo4FIafpR-AJuHsJQKikASQBIUby7OiNAooh6cf9lRXTe2e6nYu9pB1lcq_6sv610r4hF_yycf6eT6uq3P-sb7-2ifKSgQ21nVJw55AR2AGDai4TAaaqfSCuBeAb5YOfs-3ZaCAGfOR7qtTwXMFfsEvp3GBjBzXlm6snHJf38UsiZuJizgnQS1y8nMmmcvlaipAv65rE2utc9vm3kVzTgdDHZ2kKJ44682y3chysjMemFM2N3pD4-cKJQwzAe8uz8k4bh60Kq14mnjdoyDciCqWYqQm2IDDBLqkfVzePuv54cJ_12N7ODD6Y02RfvlBycy0fNEW4df8r4ZJ_zH_e8GLfiz45EfLJfAp-KllryAvBBJPpnm25-Nepn4rGwOe8ksgrcK0vc26SzdPQePiLpoR02oBYnXIOkao1bePkDINWkAjJ7zITsgaTGggVBX1d9GcjXVjfJXm6c_AECnDArUxHBqo_u9AVxB7IBIR0QwY2uLAQO3PD43DJ-dN9l5vs3fwro1EYSSrQJM0KEMA-SiNlcYsZ7VlkdVpzy42ybcbZDRabV8NUNCrNXe310lLt8-D1d6yxqbqfjOeOJxie5Gk68nuut6LxwVg-63PHgk24rUebB1WcfNAyF2ke8gvSGHUSmCtqKSV482OMTLlan3ZhPfd5Bmt0706D65GgWLiHXarnE63DWbX2JdULTufmpbQZ-TeMA2jBEwTuGDhIS-5c5oQM8EOKg57Rd0WGVe11aN5MUbmv8eoCwCmvck07mDH6vLSLXqTRVKtITVjCDasxqwEuMNi1c0J3U0_TTZ989g-Xr1MaVigRpWhgeLAQBEMDRSGBsoDA52hp1T88-_wt89OBhAKItRkQVQVVeCjHhWkWp27YVazNApIeHyi0QFRXEShpvoRXeQtFh0XxtA1UpdCC4bWpQzD9oasU0RY6nVt5XThc57Ws869FH05ATn44k66gZN0z-2MGYprcHBs0DJf3I8fsve460eduI8Q1m9v_C1jT9dFytw6mMvun9ivua8R_wVxbY3o/dl4/d5/L2dBISEvZ0FBIS9nQSEh"
soup = make_soup(web_url)

link = soup.find('div', {"class": "module"}).findAll('a')[0]
url_req = link.get('href')

absolute_link = urllib.parse.urljoin(web_url, url_req)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM