简体   繁体   中英

Jsoup page not giving correct HTML

Yes I have tried researching this countless times. Been trying to make a scraper for whitepages as a test to show how easy it is to collect public information.

My current mess of code:

package whitescraper;

import java.util.Map;

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.Connection.Method;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class whitescraper {
public static void main(String args[]){

    try {
    /*  Document doc = Jsoup.connect("https://www.whitepages.com/phone/1-314-677-6077").ignoreHttpErrors(true).maxBodySize(0).get();
        Elements elements = doc.select(".phone-details");
        System.out.println(elements);
    */

        Connection.Response res = Jsoup.connect("https://www.whitepages.com/")
                //.data("email", "oldemailetc", "pass", "passwordarea")
                .method(Method.POST)
                .execute();

        Map<String, String> cookies = res.cookies();





         Document doc2 = Jsoup.connect("https://www.whitepages.com/phone/1-314-677-6077")
                    .cookies(cookies)
                    .ignoreHttpErrors(true)
                    .maxBodySize(0)
                    //.cookie("_whitepages_session", "OUFKdExxR2JEUUdiZCtXM3JsZ2o566bushdid911N2b1h0VVI3S08wdUx2dDVBcGZSNDVRZlBKMG1DZXZyNFVxdDhaQjZIcVFPUGh5TUZuczJxalg5Q1NJL0xibVdYcTBsQmRMbjZpcWdXZi9vZmNoMmtJT0xMbW9jaFpRKzhRNGhHR0N5aVhxVkJEQzVtYzRwejdKZ3k4SWEzYXNRU0I2TnMwWXBsNDBCZVV6SnlyOFJ0bzNCd3FlRmtBaTZ2SDJRZERKQzNGVTA5NlU5azNubVg2VmtmMDdPb3p2dEZNMD0tLWFNbyt0dTJWQ1F4ano5OHEwVHVIY3c9PQ%3D%3D--4b35f34b72d3b1dd978dc8580749c41dc93e0d7a")
                    .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
                    .referrer("http://www.google.com")  
                    //.timeout(10000)
                    .get();


        //Document doc2 = Jsoup.parse("http://www.whitepages.com/phone/1-314-677-6077");


        System.out.println(doc2.html());
    }catch(Exception e){
        e.printStackTrace();
    }
}
}

If I just access the page as http without any cookies or anything it gives me different HTML than the page I want.

I've tried these:

Jsoup, http error 416, parsing HTML

https://groups.google.com/forum/#!topic/jsoup/54X6vcbdEUg

and my current code is a mix of 50+ different attempts. I first thought I was parsing the page wrong and looking for a class that doesn't exist. But then I tried it on try jsoup and it worked perfectly. If anyone could clarify the problem I'd be very greatful.

Possible problems?: -Missing correct cookies -Using http/https instead of http/https -not parsing class correctly -please help I haven't used jsoup in a year+ and it's kicking my butt

I also have a similar code for FB where I actually login and view a page correctly (that's why I tried logging into whitepages without a login page as a test) but due to the number of requests allowed and slowness I decided to try whitepages.

So the first part I had commented out did work but for some reason wouldn't give me access to the page. Literally all I had to do was change .com to their .ca domain. 2 characters for 3 hrs of error solving fml. Would love for someone to find a way to use .com domain though. Working code below.

Document doc = Jsoup.connect("http://www.whitepages.ca/phone/1-314-677-6077").ignoreHttpErrors(true).maxBodySize(0).get();
        Element element = doc.select(".phone-details").first();
        System.out.println(element.text());

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM