简体繁体 English

Chrome / Firefox网络浏览器自动化功能，用于收集数据

[英]Chrome/Firefox web browser automation for collect data

原文 2013-06-05 00:24:44 6 3 java/ javascript/ google-chrome/ firefox/ automation

I would like to browse automatically in a website to collect some data. 我想在网站上自动浏览以收集一些数据。

There's a page with a form. 有一个带有表单的页面。 The form consists of a select and a submit button. 该表单包含一个选择和一个提交按钮。 Selecting an option of the select and clicking on the submit button leads to another page where there's some tables with related data. 选择选择中的一个选项并单击提交按钮，将转到另一页，其中有一些包含相关数据的表。

I need to collect and save in file this data for each option. 我需要收集每个选项的数据并将其保存在文件中。 Probably I will need to go back to the first page to repeat the task for each option. 可能我需要回到第一页才能为每个选项重复执行该任务。 The detail is that I don't know the exactly number of options previously. 详细信息是我之前不知道确切的选项数量。

My idea is to do that task, preferably, with Firefox or Chrome. 我的想法是最好使用Firefox或Chrome来完成该任务。 I think that the only way to do that is via programming. 我认为唯一的方法就是通过编程。

Someone could indicate me a way to do that task in a easy and fast way. 有人可以告诉我一种简便快捷的方法来完成该任务。 I know a little bit about Java, Javascript and Python. 我对Java，Javascript和Python有所了解。

3 个解决方案

You might want to google "web browser automation" tool like Selenium. 您可能想使用Google的Selenium之类的“网络浏览器自动化”工具。 Although not entirely fit for the purpose I think it can be used to implement your requirement. 尽管不完全适合此目的，但我认为它可以用于实现您的要求。

Since the task is relatively well constrained, I would avoid Selenium (it's a little brittle), and instead try this approach: 由于任务相对受限，因此我将避免使用Selenium（这有点脆弱），而是尝试以下方法：

Get a comprehensive list of options from the first page, record that in a text file 从第一页获取选项的完整列表，并将其记录在文本文件中
Capture, using a network monitoring tool like Fiddler, the traffic that is sent when you submit the first page. 使用Fiddler等网络监视工具捕获提交第一页时发送的流量。 See what exactly is submitted to the server - and how (POST vs GET, parameter encoding, etc). 查看确切地提交给服务器的内容以及操作方式（POST与GET，参数编码等）。
Use a tool like curl to replay the request steps in the exact format that you captured in step 2. Then write a batch script (using bash or python) to run through all the values in the text file from step 1 to do curl for all the values in the dropdown. 使用curl之类的工具以您在步骤2中捕获的确切格式重放请求步骤。然后编写一个批处理脚本（使用bash或python）以遍历步骤1中文本文件中的所有值以对所有文件进行curl下拉列表中的值。 Save curl output to files. 将curl输出保存到文件。