[英]How to retrieve information from a website using Beautiful Soup?
I have come across a task where I have to retrieve information from a website using a crawler.我遇到了一项任务,我必须使用爬虫从网站检索信息。 (url: https://www.onepa.gov.sg/cat/adventure )
(网址: https://www.onepa.gov.sg/cat/adventure )
The website has multiple products.该网站有多种产品。 For each product, it contains link that directs us to a webpage of that individual product, and I want to collect all of the links.
对于每个产品,它都包含将我们定向到该单个产品的网页的链接,我想收集所有链接。
screenshot of the HTML code HTML 代码的屏幕截图
For example, one of the product has name: KNOTTY STUFF, and I expect to get the href of /class/details/c026829364例如,其中一个产品的名称为:KNOTTY STUFF,我希望得到 /class/details/c026829364 的 href
import requests
from bs4 import BeautifulSoup
def get_soup(url):
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, features="html.parser")
return soup
url = "https://www.onepa.gov.sg/cat/adventure"
soup = get_soup(url)
for i in soup.findAll("a", {"target": "_blank"}):
print(i.get("href"))
The output is https://tech.gov.sg/report_vulnerability https://www.pa.gov.sg/feedback
Which does not include what I was looking for: /class/details/c026829364 The output is
https://tech.gov.sg/report_vulnerability https://www.pa.gov.sg/feedback
Which does not include what I was looking for: /class/details/c026829364
I appreciate any helps or assistance, thank you!我感谢任何帮助或帮助,谢谢!
It happens because the page uses dynamic javascript to prepare the spans links.这是因为页面使用动态 javascript来准备跨度链接。 So you won't be able to accomplish it using normal
requests
.因此,您将无法使用普通
requests
来完成它。
Instead you should use selenium with a webdriver to load all the links before scraping.相反,您应该使用 selenium 和 webdriver 在抓取之前加载所有链接。
You can try downloading ChromeDriver executable here .您可以尝试在此处下载 ChromeDriver 可执行文件。 And if you paste it in the same folder as your script you can run:
如果将其粘贴到与脚本相同的文件夹中,则可以运行:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import WebDriverException
import os
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--window-size=1920x1080")
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe" # CHANGE THIS PATH IF NOT SAME FOLDER
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)
url = "https://www.onepa.gov.sg/cat/adventure"
driver.get(url)
try:
# Waint links to be ready
WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, ".gridTitle > span > a"))
)
except WebDriverException:
print("Page offline") # Added this because page is really unstable :(
elements = driver.find_elements_by_css_selector(".gridTitle > span > a")
links = [elem.get_attribute('href') for elem in elements]
print(links)
The website is loaded dynamically, therefore requests
won't support it.该网站是动态加载的,因此
requests
将不支持它。 However, the links are available via sending a POST
request to:但是,可以通过向以下位置发送
POST
请求来获得这些链接:
https://www.onepa.gov.sg/sitecore/shell/WebService/Card.asmx/GetCategoryCard
Try searching for the links using the built-in re (regex) module尝试使用内置的re (regex) 模块搜索链接
import re
import requests
URL = "https://www.onepa.gov.sg/sitecore/shell/WebService/Card.asmx/GetCategoryCard"
headers = {
"authority": "www.onepa.gov.sg",
"accept": "application/json, text/javascript, */*; q=0.01",
"x-requested-with": "XMLHttpRequest",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
"content-type": "application/json; charset=UTF-8",
"origin": "https://www.onepa.gov.sg",
"sec-fetch-site": "same-origin",
"sec-fetch-mode": "cors",
"sec-fetch-dest": "empty",
"referer": "https://www.onepa.gov.sg/cat/adventure",
"cookie": "visid_incap_2318972=EttdbbMDQMeRolY+XzbkN8tR5l8AAAAAQUIPAAAAAAAjkedvsgJ6Zxxk2+19JR8Z; SC_ANALYTICS_GLOBAL_COOKIE=d6377e975a10472b868e47de9a8a0baf; _sp_ses.075f=*; ASP.NET_SessionId=vn435hvgty45y0fcfrold2hx; sc_pview_shuser=; __AntiXsrfToken=30b776672938487e90fc0d2600e3c6f8; BIGipServerpool_PAG21PAPRPX00_443=3138016266.47873.0000; incap_ses_7221_2318972=5BC1VKygmjGGtCXbUiU2ZNRS5l8AAAAARKX8luC4fGkLlxnme8Ydow==; font_multiplier=0; AMCVS_DF38E5285913269B0A495E5A%40AdobeOrg=1; _sp_ses.603a=*; SC_ANALYTICS_SESSION_COOKIE=A675B7DEE34A47F9803ED6D4EC4A8355|0|vn435hvgty45y0fcfrold2hx; _sp_id.603a=d539f6d1-732d-4fca-8568-e8494f8e584c.1608930022.1.1608930659.1608930022.bfeb4483-a418-42bb-ac29-42b6db232aec; _sp_id.075f=5e6c62fd-b91d-408e-a9e3-1ca31ee06501.1608929756.1.1608930947.1608929756.73caa28b-624c-4c21-9ad0-92fd2af81562; AMCV_DF38E5285913269B0A495E5A%40AdobeOrg=1075005958%7CMCIDTS%7C18622%7CMCMID%7C88630464609134511097093602739558212170%7CMCOPTOUT-1608938146s%7CNONE%7CvVersion%7C4.4.1",
}
data = '{"cat":"adventure", "subcat":"", "sort":"", "filter":"[filter]", "cp":"[cp]"}'
response = requests.post(URL, data=data, headers=headers)
print(re.findall(r"<Link>(.*)<", response.content.decode("unicode_escape")))
Output: Output:
['/class/details/c026829364', '/interest/details/i000027991', '/interest/details/i000009714']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.