简体   繁体   中英

can't connect to urls ending with .tv using Jsoup

I tried to Parse the web pages ending with .tv and .mobi extension but every time I tried I end up with the same error. Jsoup can easily parse the websites ending with .com , .org , .in etc but not .tv or .mobi .

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Element;


public class sample {

  public static void main(String[] args) throws IOException{

    Document doc =Jsoup.connect("http://www.xmovies8.tv").get();
    String title = doc.title();
    System.out.println(title);

  }

}

Stack trace:

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://www.xmovies8.tv
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:598)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:548)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:235)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:224)
    at eric.sample.main(sample.java:30)
    /home/azeem/.cache/netbeans/8.1/executor-snippets/run.xml:53: Java returned: 1
    BUILD FAILED (total time: 3 seconds)

And Also it failed to parse:

is there any solution in Jsoup so that i can connect to different websites ending with .mobi , .tv , .xyz etc?

Your problem isn't anything to do with the TLD of the domain you're attempting to scrape, infact, it's nothing to do with the name at all, or even Jsoup.

If you read your stack trace, you will see you're getting a response code of:

HTTP 403 Forbidden , which according to HTTP Specification , means your request was seen by the web server, and deliberately refused.

Now, this could be for a number of reasons that all depend on the website you're trying to scrape.

It could be that the website sees you are trying to scrape, and they have explicitly gone out of the way to prevent being scraped

It could also be that that page requires a permission you don't have, or you need to be logged in.

I also noticed that particular domain uses CloudFlare , so it could be that CloudFlare is intercepting your request before it's even reaching the website itself.

I would make sure it's not against the website's policy to scrape them, and if it isn't, try maybe changing the User-Agent Header of your scraper to a normal browser User agent instead of java and see if it works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM