I am parsing pages for email data . How would I get a hidden email - which is generated using JavaScript .This is the page I am parsing a page If you would take a look on the html source(using firebug or something else) you would see that it is a link tag generated inside div named sobi2Details_field_email and set to be display:none . This is my code for now , but the problem is with email
doc = Jsoup.connect(strLine).get();
Element e5=doc.getElementById("sobi2Details_field_email");
if(e5!=null)
{
emaildata=e5.child(1).absUrl("href").toString();
}
System.out.println (emaildata);
You need to do several steps because Jsoup doesn't allow you to execute JavaScript. I reverse engineered it and this is what came out:
public static void main(final String[] args) throws IOException
{
final String url = "http://poslovno.com/kategorije.html?sobi2Task=sobi2Details&catid=71&sobi2Id=20001";
final Document doc = Jsoup.connect(url).get();
final Element e5 = doc.getElementById("sobi2Details_field_email");
System.out.println("--- this is how we start");
System.out.println(e5 + "\n\n\n\n");
// remove the xml encoding
System.out.println("---Remove XML encoding\n");
String email = org.jsoup.parser.Parser.unescapeEntities(e5.toString(), false);
System.out.println(email + "\n\n\n\n");
// remove the concatunation with ' + '
System.out.println("--- Remove concatunation (all: ' + ')");
email = email.replaceAll("' \\+ '", "");
System.out.println(email + "\n\n\n\n");
// extract the email address variables
System.out.println("--- Remove useless lines");
Matcher matcher = Pattern.compile("var addy.*var addy", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
matcher.find();
email = matcher.group();
System.out.println(email + "\n\n\n\n");
// get the to string enclosed by '' and concatunate
System.out.println("--- Extract the email address");
matcher = Pattern.compile("'(.*)'.*'(.*)'", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
matcher.find();
email = matcher.group(1) + matcher.group(2);
System.out.println(email);
}
If something is generated dynamicly with javascript on client side after response from server is complete, that there is no other way than:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.