简体   繁体   English

硒,phantomJS和Geb无头刮擦

[英]Selenium, phantomJS and Geb headless scraping

I need to scrap data from web site on weekly basis. 我需要每周从网站上抓取数据。 Data is visible only after click on the page(js function is called). 仅在单击页面(调用js函数)后,数据才可见。 Data is loaded in a table(which can be found by id). 数据加载到表中(可以通过id找到)。 This script will be run on a server without browser support. 该脚本将在不支持浏览器的服务器上运行。 This is my code with geb: 这是我与geb的代码:

    @Grab("org.gebish:geb-core:0.13.1")
    @Grab("org.seleniumhq.selenium:selenium-firefox-driver:2.52.0")
    @Grab("org.seleniumhq.selenium:selenium-support:2.52.0")
    @GrabExclude('org.codehaus.groovy:groovy-all')  

    import geb.Browser

Browser.drive{
    // driver.webClient.javaScriptEnabled = true
    go "mysite"
    js.loadWeekData()
   println $("div.data-listing").text()
    }

I've searched a lot on this topic but nothing was working as headless scraping with js support. 我在这个主题上进行了很多搜索,但是没有任何工作像使用js支持那样毫无意义地进行抓取。 This is the record from Selenium IDE: 这是Selenium IDE的记录:

driver.findElement(By.linkText("Next")).click();

I was not able to make phantomJS to work with geb. 我无法使phantomJS与geb一起使用。

Edit 1 This is the error from phantom js: java.lang.NoClassDefFoundError: org/openqa/selenium/browserlaunchers/Proxies I've read about the problem with versions but I was not able to resolve it. 编辑1这是来自幻影js的错误:java.lang.NoClassDefFoundError:org / openqa / selenium / browserlaunchers / Proxies我已经阅读过有关版本问题的信息,但无法解决。

@Grab("org.gebish:geb-core:0.13.1")
@Grab("org.seleniumhq.selenium:selenium-firefox-driver:2.52.0")
@Grab("org.seleniumhq.selenium:selenium-support:2.52.0")
@Grab("com.codeborne:phantomjsdriver:1.3.0")
WebDriver driver = new PhantomJSDriver();

        // Load Google.com
        driver.get("http://www.google.com");
        // Locate the Search field on the Google page
        WebElement element = driver.findElement(By.name("q"));

In short I need to perform the first script in headless mode(if possible without installing Xvfb). 简而言之,我需要以无头模式执行第一个脚本(如果可能的话,不安装Xvfb)。 Preferably groovy or java solution. 最好是groovy或java解决方案。

Finally I'll use HTMLUNIT and code like this: 最后,我将使用HTMLUNIT和如下代码:

This code needs some cleaning but in general is working. 此代码需要一些清洗,但通常可以正常工作。 Main problem of HTMLUNIT - warnings and errors is solved by logging settings for stop. HTMLUNIT的主要问题-通过记录停止设置来解决警告和错误。

@Grab(group='net.sourceforge.htmlunit', module='htmlunit', version='2.21')

import com.gargoylesoftware.htmlunit.AlertHandler;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlButton;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.util.logging.Level;

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);

WebClient webClient = new WebClient();
      webClient.waitForBackgroundJavaScriptStartingBefore(10000);
        HtmlPage currentPage = webClient.getPage("mysite");
       /* HtmlButton button = (HtmlButton) currentPage.getElementById("tomorrow");
        button.click();*/

        //String javaScriptCode = "loadTomorrowTrain();";
        String javaScriptCode = "loadYesterdayTrain();";


def result = currentPage.executeJavaScript(javaScriptCode);
//def result = page.executeJavaScript(javaScriptCode);
      webClient.waitForBackgroundJavaScriptStartingBefore(10000);
println result.getJavaScriptResult();
println "result: "+ result

def newpage = result.getNewPage()
def table = result.getNewPage().getElementById("training-days");
println table
def spans = currentPage.getByXPath( "//div[@training-days]");
println spans
def spans1 = newpage.getByXPath("//div[@class='training-days']//a");
println spans1
def spans2 = currentPage.getByXPath("//div[@class='training-days']//a");
println spans2
def spans3 = currentPage.getByXPath("//table[@id='training']");
println spans3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM