Yes I have tried researching this countless times. Been trying to make a scraper for whitepages as a test to show how easy it is to collect public information.
My current mess of code:
package whitescraper;
import java.util.Map;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.Connection.Method;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class whitescraper {
public static void main(String args[]){
try {
/* Document doc = Jsoup.connect("https://www.whitepages.com/phone/1-314-677-6077").ignoreHttpErrors(true).maxBodySize(0).get();
Elements elements = doc.select(".phone-details");
System.out.println(elements);
*/
Connection.Response res = Jsoup.connect("https://www.whitepages.com/")
//.data("email", "oldemailetc", "pass", "passwordarea")
.method(Method.POST)
.execute();
Map<String, String> cookies = res.cookies();
Document doc2 = Jsoup.connect("https://www.whitepages.com/phone/1-314-677-6077")
.cookies(cookies)
.ignoreHttpErrors(true)
.maxBodySize(0)
//.cookie("_whitepages_session", "OUFKdExxR2JEUUdiZCtXM3JsZ2o566bushdid911N2b1h0VVI3S08wdUx2dDVBcGZSNDVRZlBKMG1DZXZyNFVxdDhaQjZIcVFPUGh5TUZuczJxalg5Q1NJL0xibVdYcTBsQmRMbjZpcWdXZi9vZmNoMmtJT0xMbW9jaFpRKzhRNGhHR0N5aVhxVkJEQzVtYzRwejdKZ3k4SWEzYXNRU0I2TnMwWXBsNDBCZVV6SnlyOFJ0bzNCd3FlRmtBaTZ2SDJRZERKQzNGVTA5NlU5azNubVg2VmtmMDdPb3p2dEZNMD0tLWFNbyt0dTJWQ1F4ano5OHEwVHVIY3c9PQ%3D%3D--4b35f34b72d3b1dd978dc8580749c41dc93e0d7a")
.userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.referrer("http://www.google.com")
//.timeout(10000)
.get();
//Document doc2 = Jsoup.parse("http://www.whitepages.com/phone/1-314-677-6077");
System.out.println(doc2.html());
}catch(Exception e){
e.printStackTrace();
}
}
}
If I just access the page as http without any cookies or anything it gives me different HTML than the page I want.
I've tried these:
Jsoup, http error 416, parsing HTML
https://groups.google.com/forum/#!topic/jsoup/54X6vcbdEUg
and my current code is a mix of 50+ different attempts. I first thought I was parsing the page wrong and looking for a class that doesn't exist. But then I tried it on try jsoup and it worked perfectly. If anyone could clarify the problem I'd be very greatful.
Possible problems?: -Missing correct cookies -Using http/https instead of http/https -not parsing class correctly -please help I haven't used jsoup in a year+ and it's kicking my butt
I also have a similar code for FB where I actually login and view a page correctly (that's why I tried logging into whitepages without a login page as a test) but due to the number of requests allowed and slowness I decided to try whitepages.
So the first part I had commented out did work but for some reason wouldn't give me access to the page. Literally all I had to do was change .com to their .ca domain. 2 characters for 3 hrs of error solving fml. Would love for someone to find a way to use .com domain though. Working code below.
Document doc = Jsoup.connect("http://www.whitepages.ca/phone/1-314-677-6077").ignoreHttpErrors(true).maxBodySize(0).get();
Element element = doc.select(".phone-details").first();
System.out.println(element.text());
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.