简体   繁体   English

Chrome / Firefox网络浏览器自动化功能,用于收集数据

[英]Chrome/Firefox web browser automation for collect data

I would like to browse automatically in a website to collect some data. 我想在网站上自动浏览以收集一些数据。

There's a page with a form. 有一个带有表单的页面。 The form consists of a select and a submit button. 该表单包含一个选择和一个提交按钮。 Selecting an option of the select and clicking on the submit button leads to another page where there's some tables with related data. 选择选择中的一个选项并单击提交按钮,将转到另一页,其中有一些包含相关数据的表。

I need to collect and save in file this data for each option. 我需要收集每个选项的数据并将其保存在文件中。 Probably I will need to go back to the first page to repeat the task for each option. 可能我需要回到第一页才能为每个选项重复执行该任务。 The detail is that I don't know the exactly number of options previously. 详细信息是我之前不知道确切的选项数量。

My idea is to do that task, preferably, with Firefox or Chrome. 我的想法是最好使用Firefox或Chrome来完成该任务。 I think that the only way to do that is via programming. 我认为唯一的方法就是通过编程。

Someone could indicate me a way to do that task in a easy and fast way. 有人可以告诉我一种简便快捷的方法来完成该任务。 I know a little bit about Java, Javascript and Python. 我对Java,Javascript和Python有所了解。

You might want to google "web browser automation" tool like Selenium. 您可能想使用Google的Selenium之类的“网络浏览器自动化”工具。 Although not entirely fit for the purpose I think it can be used to implement your requirement. 尽管不完全适合此目的,但我认为它可以用于实现您的要求。

Since the task is relatively well constrained, I would avoid Selenium (it's a little brittle), and instead try this approach: 由于任务相对受限,因此我将避免使用Selenium(这有点脆弱),而是尝试以下方法:

  • Get a comprehensive list of options from the first page, record that in a text file 从第一页获取选项的完整列表,并将其记录在文本文件中
  • Capture, using a network monitoring tool like Fiddler, the traffic that is sent when you submit the first page. 使用Fiddler等网络监视工具捕获提交第一页时发送的流量。 See what exactly is submitted to the server - and how (POST vs GET, parameter encoding, etc). 查看确切地提交给服务器的内容以及操作方式(POST与GET,参数编码等)。
  • Use a tool like curl to replay the request steps in the exact format that you captured in step 2. Then write a batch script (using bash or python) to run through all the values in the text file from step 1 to do curl for all the values in the dropdown. 使用curl之类的工具以您在步骤2中捕获的确切格式重放请求步骤。然后编写一个批处理脚本(使用bash或python)以遍历步骤1中文本文件中的所有值以对所有文件进行curl下拉列表中的值。 Save curl output to files. 将curl输出保存到文件。

I found a solution to my problem. 我找到了解决问题的方法。 It's called HtmlUnit: 它称为HtmlUnit:

http://htmlunit.sourceforge.net/gettingStarted.html http://htmlunit.sourceforge.net/gettingStarted.html

HtmlUnit is a "GUI-Less browser for Java programs". HtmlUnit是“用于Java程序的无GUI浏览器”。

It allows to web browsing and data collecting using Java and it's very simple and easy to use. 它允许使用Java进行Web浏览和数据收集,并且非常简单易用。

Not exactly what I asked, but it's better. 并不是我问的那样,但是更好。 At least to me. 至少对我来说。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 网上冲浪/浏览器自动化 - Web Surfing/Browser Automation Internet Explorer浏览器无法正确提交数据,但在Firefox和Chrome中可以 - Internet explorer browser not submitting data correctly but ok in firefox and chrome 如何将硒Web驱动程序用于所有浏览器-chrome,safari,firefox,IE(JAVA) - How to use selenium web driver for all browser-chrome,safari,firefox,IE(JAVA) 如何在selenium webdriver中将Web浏览器从Firefox更改为Chrome / Opera / IE / Safari? - How to change web browser from Firefox to Chrome/Opera/IE/Safari in selenium webdriver? 无法使用Selenium(Java)自动化在firefox浏览器中右键单击 - Unable to right click in firefox browser using selenium (java) automation 如何在测试自动化中处理 Firefox 浏览器确认消息? - How to handle Firefox browser confirmation messages in Test Automation? 如何在Chrome或Firefox浏览器上运行javascript - How to run javascript on Chrome or Firefox browser Xsl transformToDocument 不适用于 Chrome 浏览器,但适用于 Firefox - Xsl transformToDocument is not working in chrome browser but working in firefox 在使用硒自动化时,单击Chrome浏览器的元素会出现问题吗? - Clicking an element for Chrome browser an issue when using selenium automation? 使用Selenium和Chrome Dev Tools进行浏览器内存泄漏自动化 - Browser memory leak automation using Selenium and Chrome Dev Tools
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM