简体   繁体   中英

Jsoup get hidden email

I am parsing pages for email data . How would I get a hidden email - which is generated using JavaScript .This is the page I am parsing a page If you would take a look on the html source(using firebug or something else) you would see that it is a link tag generated inside div named sobi2Details_field_email and set to be display:none . This is my code for now , but the problem is with email

 doc = Jsoup.connect(strLine).get();
 Element e5=doc.getElementById("sobi2Details_field_email");

if(e5!=null)
 {
 emaildata=e5.child(1).absUrl("href").toString();

 }
  System.out.println (emaildata);

You need to do several steps because Jsoup doesn't allow you to execute JavaScript. I reverse engineered it and this is what came out:

public static void main(final String[] args) throws IOException
{
    final String url = "http://poslovno.com/kategorije.html?sobi2Task=sobi2Details&catid=71&sobi2Id=20001";
    final Document doc = Jsoup.connect(url).get();
    final Element e5 = doc.getElementById("sobi2Details_field_email");

    System.out.println("--- this is how we start");
    System.out.println(e5 + "\n\n\n\n");

    // remove the xml encoding
    System.out.println("---Remove XML encoding\n");
    String email = org.jsoup.parser.Parser.unescapeEntities(e5.toString(), false);
    System.out.println(email + "\n\n\n\n");

    // remove the concatunation with ' + '
    System.out.println("--- Remove concatunation (all: ' + ')");
    email = email.replaceAll("' \\+ '", "");
    System.out.println(email + "\n\n\n\n");

    // extract the email address variables
    System.out.println("--- Remove useless lines");
    Matcher matcher = Pattern.compile("var addy.*var addy", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
    matcher.find();

    email = matcher.group();
    System.out.println(email + "\n\n\n\n");

    // get the to string enclosed by '' and concatunate
    System.out.println("--- Extract the email address");
    matcher = Pattern.compile("'(.*)'.*'(.*)'", Pattern.MULTILINE + Pattern.DOTALL).matcher(email);
    matcher.find();

    email = matcher.group(1) + matcher.group(2);
    System.out.println(email);

}

If something is generated dynamicly with javascript on client side after response from server is complete, that there is no other way than:

  1. Reverse engineering - figure out what does server side script do, and try to implement same behaviour
  2. Download javascript from processed page, and use java's javascript processor to execute such script and get result (yeah, it is possible, and i was forced to do such thing). Here you have basic example showing how to evaluate javascript in java.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM