can't connect to urls ending with .tv using Jsoup

Question

I tried to Parse the web pages ending with .tv and .mobi extension but every time I tried I end up with the same error. Jsoup can easily parse the websites ending with .com , .org , .in etc but not .tv or .mobi .

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Element;


public class sample {

  public static void main(String[] args) throws IOException{

    Document doc =Jsoup.connect("http://www.xmovies8.tv").get();
    String title = doc.title();
    System.out.println(title);

  }

}

Stack trace:

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://www.xmovies8.tv
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:598)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:548)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:235)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:224)
    at eric.sample.main(sample.java:30)
    /home/azeem/.cache/netbeans/8.1/executor-snippets/run.xml:53: Java returned: 1
    BUILD FAILED (total time: 3 seconds)

And Also it failed to parse:

http://www.xmovies8.tv
www.fztvseries.mobi

is there any solution in Jsoup so that i can connect to different websites ending with .mobi , .tv , .xyz etc?

Answer 1

Your problem isn't anything to do with the TLD of the domain you're attempting to scrape, infact, it's nothing to do with the name at all, or even Jsoup.

If you read your stack trace, you will see you're getting a response code of:

HTTP 403 Forbidden , which according to HTTP Specification , means your request was seen by the web server, and deliberately refused.

Now, this could be for a number of reasons that all depend on the website you're trying to scrape.

It could be that the website sees you are trying to scrape, and they have explicitly gone out of the way to prevent being scraped

It could also be that that page requires a permission you don't have, or you need to be logged in.

I also noticed that particular domain uses CloudFlare , so it could be that CloudFlare is intercepting your request before it's even reaching the website itself.

I would make sure it's not against the website's policy to scrape them, and if it isn't, try maybe changing the User-Agent Header of your scraper to a normal browser User agent instead of java and see if it works.

can't connect to urls ending with .tv using Jsoup

Question

1 answers

solution1
0 ACCPTED 2016-12-12 19:51:10

can't connect to urls ending with .tv using Jsoup

Question

1 answers

solution1 0 ACCPTED 2016-12-12 19:51:10

solution1
0 ACCPTED 2016-12-12 19:51:10