简体   繁体   English

如何使用 Beautiful Soup 从网站检索信息?

[英]How to retrieve information from a website using Beautiful Soup?

I have come across a task where I have to retrieve information from a website using a crawler.我遇到了一项任务,我必须使用爬虫从网站检索信息。 (url: https://www.onepa.gov.sg/cat/adventure ) (网址: https://www.onepa.gov.sg/cat/adventure

The website has multiple products.该网站有多种产品。 For each product, it contains link that directs us to a webpage of that individual product, and I want to collect all of the links.对于每个产品,它都包含将我们定向到该单个产品的网页的链接,我想收集所有链接。

screenshot of the webpage网页截图

screenshot of the HTML code HTML 代码的屏幕截图

For example, one of the product has name: KNOTTY STUFF, and I expect to get the href of /class/details/c026829364例如,其中一个产品的名称为:KNOTTY STUFF,我希望得到 /class/details/c026829364 的 href

import requests
from bs4 import BeautifulSoup


def get_soup(url):
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, features="html.parser")
    return soup

url = "https://www.onepa.gov.sg/cat/adventure"
soup = get_soup(url)
for i in soup.findAll("a", {"target": "_blank"}):
    print(i.get("href"))

The output is https://tech.gov.sg/report_vulnerability https://www.pa.gov.sg/feedback Which does not include what I was looking for: /class/details/c026829364 The output is https://tech.gov.sg/report_vulnerability https://www.pa.gov.sg/feedback Which does not include what I was looking for: /class/details/c026829364

I appreciate any helps or assistance, thank you!我感谢任何帮助或帮助,谢谢!

It happens because the page uses dynamic javascript to prepare the spans links.这是因为页面使用动态 javascript来准备跨度链接。 So you won't be able to accomplish it using normal requests .因此,您将无法使用普通requests来完成它。

Instead you should use selenium with a webdriver to load all the links before scraping.相反,您应该使用 selenium 和 webdriver 在抓取之前加载所有链接。

You can try downloading ChromeDriver executable here .您可以尝试在此处下载 ChromeDriver 可执行文件。 And if you paste it in the same folder as your script you can run:如果将其粘贴到与脚本相同的文件夹中,则可以运行:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import WebDriverException
import os

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--window-size=1920x1080")
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe"  # CHANGE THIS PATH IF NOT SAME FOLDER
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)

url = "https://www.onepa.gov.sg/cat/adventure"
driver.get(url)

try:
    # Waint links to be ready
    WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, ".gridTitle > span > a"))
    )
except WebDriverException:
    print("Page offline")  # Added this because page is really unstable :(

elements = driver.find_elements_by_css_selector(".gridTitle > span > a")
links = [elem.get_attribute('href') for elem in elements]
print(links)

The website is loaded dynamically, therefore requests won't support it.该网站是动态加载的,因此requests将不支持它。 However, the links are available via sending a POST request to:但是,可以通过向以下位置发送POST请求来获得这些链接:

https://www.onepa.gov.sg/sitecore/shell/WebService/Card.asmx/GetCategoryCard

Try searching for the links using the built-in re (regex) module尝试使用内置的re (regex) 模块搜索链接

import re
import requests


URL = "https://www.onepa.gov.sg/sitecore/shell/WebService/Card.asmx/GetCategoryCard"

headers = {
    "authority": "www.onepa.gov.sg",
    "accept": "application/json, text/javascript, */*; q=0.01",
    "x-requested-with": "XMLHttpRequest",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
    "content-type": "application/json; charset=UTF-8",
    "origin": "https://www.onepa.gov.sg",
    "sec-fetch-site": "same-origin",
    "sec-fetch-mode": "cors",
    "sec-fetch-dest": "empty",
    "referer": "https://www.onepa.gov.sg/cat/adventure",
    "cookie": "visid_incap_2318972=EttdbbMDQMeRolY+XzbkN8tR5l8AAAAAQUIPAAAAAAAjkedvsgJ6Zxxk2+19JR8Z; SC_ANALYTICS_GLOBAL_COOKIE=d6377e975a10472b868e47de9a8a0baf; _sp_ses.075f=*; ASP.NET_SessionId=vn435hvgty45y0fcfrold2hx; sc_pview_shuser=; __AntiXsrfToken=30b776672938487e90fc0d2600e3c6f8; BIGipServerpool_PAG21PAPRPX00_443=3138016266.47873.0000; incap_ses_7221_2318972=5BC1VKygmjGGtCXbUiU2ZNRS5l8AAAAARKX8luC4fGkLlxnme8Ydow==; font_multiplier=0; AMCVS_DF38E5285913269B0A495E5A%40AdobeOrg=1; _sp_ses.603a=*; SC_ANALYTICS_SESSION_COOKIE=A675B7DEE34A47F9803ED6D4EC4A8355|0|vn435hvgty45y0fcfrold2hx; _sp_id.603a=d539f6d1-732d-4fca-8568-e8494f8e584c.1608930022.1.1608930659.1608930022.bfeb4483-a418-42bb-ac29-42b6db232aec; _sp_id.075f=5e6c62fd-b91d-408e-a9e3-1ca31ee06501.1608929756.1.1608930947.1608929756.73caa28b-624c-4c21-9ad0-92fd2af81562; AMCV_DF38E5285913269B0A495E5A%40AdobeOrg=1075005958%7CMCIDTS%7C18622%7CMCMID%7C88630464609134511097093602739558212170%7CMCOPTOUT-1608938146s%7CNONE%7CvVersion%7C4.4.1",
}

data = '{"cat":"adventure", "subcat":"", "sort":"", "filter":"[filter]", "cp":"[cp]"}'

response = requests.post(URL, data=data,  headers=headers)
print(re.findall(r"<Link>(.*)<", response.content.decode("unicode_escape")))

Output: Output:

['/class/details/c026829364', '/interest/details/i000027991', '/interest/details/i000009714']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM