简体   繁体   English

如何使用BeautifulSoup遍历网站的每个页面以进行网页抓取

[英]How to loop through each page of website for web scraping with BeautifulSoup

I am scraping job posting data from a website using BeautifulSoup. 我正在使用BeautifulSoup从网站抓取职位发布数据。 I have working code that does what I need, but it only scrapes the first page of job postings. 我有满足我需要的工作代码,但它只会刮取职位发布的第一页。 I am having trouble figuring out how to iteratively update the url to scrape each page. 我在弄清楚如何迭代更新URL以刮擦每个页面时遇到了麻烦。 I am new to Python and have looked at a few different solutions to similar questions, but have not figured out how to apply them to my particular url. 我是Python的新手,曾经研究过几种解决类似问题的方法,但是还没有弄清楚如何将其应用于我的特定网址。 I think I need to iteratively update the url or somehow click the next button and then loop my existing code through each page. 我认为我需要迭代更新URL或以某种方式单击“下一步”按钮,然后在每个页面中循环我现有的代码。 I appreciate any solutions. 我感谢任何解决方案。

url: https://jobs.utcaerospacesystems.com/search-jobs 网址: https//jobs.utcaerospacesystems.com/search-jobs

First, BeautifulSoup doesn't have anything to do with GETing web pages - you get the webpage yourself, then feed it to bs4 for processing. 首先,BeautifulSoup与获取网页没有任何关系-您可以自己获取网页,然后将其提供给bs4进行处理。

The problem with the page you linked is that it's javascript - it only renders correctly in a browser (or any other javascript VM). 您链接的页面的问题在于它是javascript-仅在浏览器(或任何其他javascript VM)中正确显示。

@Fabricator is on the right track - you'll need to watch the developer console and see what the ajax requests the js is sending to the server. @Fabricator处在正确的轨道上-您需要观察开发人员控制台,并查看ajax请求js将其发送到服务器的内容。 In this case, also take a look at the query string params, which include a param called CurrentPage - that's probably the one you want to focus on. 在这种情况下,还要看一下查询字符串参数,其中包括一个称为CurrentPage的参数-可能是您要关注的参数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM