简体繁体 English

使用 Java 和 Selenium 抓取动态网站？

[英]Scrape a Dynamic Website using Java with Selenium?

原文 2022-02-19 00:02:59 1 1 java/ selenium/ web-crawler/ crawler4j

I'm trying to scrape https://www.rspca.org.uk/findapet#onSubmitSetHere to get a list of all pets for adoption.我正在尝试抓取https://www.rspca.org.uk/findapet#onSubmitSetHere以获取所有待领养宠物的列表。

I've built web scrapers before using crawler4j but the websites were static.在使用crawler4j之前，我已经构建了 web 个爬虫，但网站是 static。

Since https://www.rspca.org.uk/findapet#onSubmitSetHere is not a static website, how can I scrape it?由于https://www.rspca.org.uk/findapet#onSubmitSetHere不是 static 网站，我该如何抓取它？ Is it possible?是否可以？ What technologies should I use and how?我应该使用哪些技术以及如何使用？

Update:更新：

When you fill in the search form (Select type of pet and Enter postcode/town or county) in the UI, the results are then displayed below the search box.当您在 UI 中填写搜索表单（选择宠物类型并输入邮政编码/城镇或县）时，结果将显示在搜索框下方。

The red is highlighted as the search bar and the black is highlighted as results.红色突出显示为搜索栏，黑色突出显示为结果。

I'm trying to scrape the results and also the content of each result.我正在尝试抓取结果以及每个结果的内容。

I've had a look at the request the browser makes to retrieve results, but from Chrome dev tools it isn't obvious what the request is being made.我查看了浏览器为检索结果而发出的请求，但从 Chrome 开发工具中看不出发出的请求是什么。

1 个解决方案

You could use Selenium to extract information from the DOM once a browser has rendered it, but I think a simpler solution is to use "developer tools" to find the request that the browser makes when the "search" button is clicked, and try to reproduce that.一旦浏览器呈现它，您可以使用 Selenium 从 DOM 中提取信息，但我认为更简单的解决方案是使用“开发人员工具”来查找浏览器在单击“搜索”按钮时发出的请求，并尝试重现那个。

In this case that makes a POST to https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search在这种情况下，POST 到https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search

The body of the POST request contains a lot of parameters, including animalType and location . POST 请求的主体包含很多参数，包括animalType和location 。 The content-type of the request is application/x-www-form-urlencoded.请求的内容类型是 application/x-www-form-urlencoded。

To see these parameters, go to the "Network" tab in chrome dev tools, click on the "findapet" request (it's the first one in the list when I do this), and click on the "payload" tab to see the query string parameters and the form parameters (which contains animalType and location )要查看这些参数，go 到 chrome 开发工具中的“网络”选项卡，单击“findapet”请求（我这样做时它是列表中的第一个），然后单击“有效负载”选项卡以查看查询字符串参数和表单参数（包含animalType和location ）

The response contains HTML.响应包含 HTML。

I would try making a request to that endpoint and then parsing the HTML in the response.我会尝试向该端点发出请求，然后在响应中解析 HTML。