简体   繁体   中英

Crawler4j not working for https urls

I am developing a grails app using crawler4j.

I know this is an old question and I came across this solution here .

I tried the solution provided but am not sure where to keep the another fetcher and mockssl java files.

Also, I am not sure how these two classes would be called in case of urls containing https://...

Thanks in advance.

The solutions works fine. Maybe you have some problems to deduce where to put the code. Here is how I use it:

When creating the crawler, you will have something like this in your main class as showed in official documentation :

public class Controller {
public static void main(String[] args) throws Exception {
    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorageFolder);

    /*
     * Instantiate the controller for this crawl.
     */
     PageFetcher pageFetcher = new MockSSLSocketFactory(config);
     RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
     RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
     CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
    ....

Here you use the MockSSLSocketFactory that is defined as showed in the link you have posted:

public class MockSSLSocketFactory extends PageFetcher {

public MockSSLSocketFactory (CrawlConfig config) {
    super(config);

    if (config.isIncludeHttpsPages()) {
        try {
            httpClient.getConnectionManager().getSchemeRegistry().unregister("https");
            httpClient.getConnectionManager().getSchemeRegistry()
                    .register(new Scheme("https", 443, new SimpleSSLSocketFactory()));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
}

As you can see, here is using the class SimpleSSLSocketFactory. That can be defined as is shown in the example of the link:

public class SimpleSSLSocketFactory extends SSLSocketFactory {

public SimpleSSLSocketFactory() throws NoSuchAlgorithmException, KeyManagementException, KeyStoreException,
        UnrecoverableKeyException {
    super(trustStrategy, hostnameVerifier);
}

private static final X509HostnameVerifier hostnameVerifier = new X509HostnameVerifier() {
    @Override
    public void verify(String host, SSLSocket ssl) throws IOException {
        // Do nothing
    }

    @Override
    public void verify(String host, String[] cns, String[] subjectAlts) throws SSLException {
        // Do nothing
    }

    @Override
    public boolean verify(String s, SSLSession sslSession) {
        return true;
    }

    @Override
    public void verify(String arg0, java.security.cert.X509Certificate arg1) throws SSLException {
        // TODO Auto-generated method stub

    }
};

private static final TrustStrategy trustStrategy = new TrustStrategy() {

    @Override
    public boolean isTrusted(java.security.cert.X509Certificate[] arg0, String arg1) throws CertificateException {
        return true;
    }
};

}

As you can see, I am only copying code from the official documentation and the link you have posted, but I hope that seeing all together would be clearer for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM