简体   繁体   English

打开并保存网页selenium java

[英]open and save web page selenium java

I need to get the content of some web pases like " http://www.ncbi.nlm.nih.gov/nuccore/NM_007002 " for my project. 我需要为我的项目获取一些Web程序的内容,例如“ http://www.ncbi.nlm.nih.gov/nuccore/NM_007002 ”。 The problem is that I need to open the page from a browser and save it to get the full content (if I try to use the libraries URL and BufferReader I get the "frame" of the page but not the text I need). 问题是我需要从浏览器中打开页面并保存以获取全部内容(如果尝试使用库URL和BufferReader,则会得到页面的“框架”,而不是所需的文本)。 My professor told me to use Seleniume to open and download the pages I need and then read and parse the relevant information. 我的教授告诉我使用Seleniume打开和下载我需要的页面,然后阅读并解析相关信息。

Unfortunately, I can't find an example from a JAVA code that open and save a web page. 不幸的是,我找不到打开和保存网页的Java代码示例。 Can anyone explane to my how to do this? 谁能对我该怎么做?

I want to SAVE the page to my computer, not copy the source and save it for file. 我想将页面保存到我的计算机,而不是复制源并将其保存为文件。 Not all of the information appears in the source! 并非所有信息都显示在源中! It's hidden. 它是隐藏的。

In Selenium you can do this: 在Selenium中,您可以执行以下操作:

SafariDriver driver = new SafariDriver(); //you can use any drivers like Chrome,FireFox
driver.get("your link");
String pageSource = driver.getPageSource(); //now you have the page source
//you can save the pageSource to the file or do what ever you want. 

Look at the getPageSource docs here . 此处查看getPageSource文档。

If you want to get data from the specific tags, like say for example body , then you can do this: 如果您想从特定标签获取数据(例如body ,则可以执行以下操作:

String pageSource=driver.findElement(By.tagName("body")).getText();

Keep in mind that Selenium is meant for web page automation, so for interacting with the pages automatically. 请记住,Selenium是用于网页自动化的,因此可以自动与页面进行交互。 If only the source is really what you need, you can use a JSoup a really solid Java Html parser, in two lines of code, you should have your source 如果只是真正需要的源,那么可以使用JSoup一个真正可靠的Java Html解析器,在两行代码中,您应该拥有源

     try {
            Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/nuccore/NM_007002").userAgent("Mozilla/5.0").timeout(30000).get();
            System.out.println(doc.toString());
        } catch (IOException e) {
            e.printStackTrace();
        }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM