简体   繁体   English

使用 Java 和 Selenium 抓取动态网站?

[英]Scrape a Dynamic Website using Java with Selenium?

I'm trying to scrape https://www.rspca.org.uk/findapet#onSubmitSetHere to get a list of all pets for adoption.我正在尝试抓取https://www.rspca.org.uk/findapet#onSubmitSetHere以获取所有待领养宠物的列表。

I've built web scrapers before using crawler4j but the websites were static.在使用crawler4j之前,我已经构建了 web 个爬虫,但网站是 static。

Since https://www.rspca.org.uk/findapet#onSubmitSetHere is not a static website, how can I scrape it?由于https://www.rspca.org.uk/findapet#onSubmitSetHere不是 static 网站,我该如何抓取它? Is it possible?是否可以? What technologies should I use and how?我应该使用哪些技术以及如何使用?

Update:更新:

When you fill in the search form (Select type of pet and Enter postcode/town or county) in the UI, the results are then displayed below the search box.当您在 UI 中填写搜索表单(选择宠物类型并输入邮政编码/城镇或县)时,结果将显示在搜索框下方。

在此处输入图像描述

The red is highlighted as the search bar and the black is highlighted as results.红色突出显示为搜索栏,黑色突出显示为结果。

I'm trying to scrape the results and also the content of each result.我正在尝试抓取结果以及每个结果的内容。

I've had a look at the request the browser makes to retrieve results, but from Chrome dev tools it isn't obvious what the request is being made.我查看了浏览器为检索结果而发出的请求,但从 Chrome 开发工具中看不出发出的请求是什么。

You could use Selenium to extract information from the DOM once a browser has rendered it, but I think a simpler solution is to use "developer tools" to find the request that the browser makes when the "search" button is clicked, and try to reproduce that.一旦浏览器呈现它,您可以使用 Selenium 从 DOM 中提取信息,但我认为更简单的解决方案是使用“开发人员工具”来查找浏览器在单击“搜索”按钮时发出的请求,并尝试重现那个。

In this case that makes a POST to https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search在这种情况下,POST 到https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search

The body of the POST request contains a lot of parameters, including animalType and location . POST 请求的主体包含很多参数,包括animalTypelocation The content-type of the request is application/x-www-form-urlencoded.请求的内容类型是 application/x-www-form-urlencoded。

To see these parameters, go to the "Network" tab in chrome dev tools, click on the "findapet" request (it's the first one in the list when I do this), and click on the "payload" tab to see the query string parameters and the form parameters (which contains animalType and location )要查看这些参数,go 到 chrome 开发工具中的“网络”选项卡,单击“findapet”请求(我这样做时它是列表中的第一个),然后单击“有效负载”选项卡以查看查询字符串参数表单参数(包含animalTypelocation

The response contains HTML.响应包含 HTML。

I would try making a request to that endpoint and then parsing the HTML in the response.我会尝试向该端点发出请求,然后在响应中解析 HTML。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM