简体   繁体   中英

<input values missing when parsing html with Jsoup?

I am trying to scrap webpages with Jsoup. Jsoup doesn't seem to capture the <input elements like Chrome does.

It is missing values such as these:

<input type=​"hidden" id=​"fileId" value=​"3168935269">
<input type=​"hidden" id=​"secondsLeft" value=​"20">​​

Using Jsoup I extracted these elements:

<input type="hidden" class="jsItemDirId" value="yRg1N-QP" />

<input type="hidden" class="jsItemFileId" value="i-EbooI0" />

<input type="hidden" id="fbAppId" value="255519317820035" />

<input type="hidden" id="sPrefix" value="http://search.4shared.com" />

<input type="hidden" class="sLink file" value="/q/CCAD/1" />

<input type="hidden" class="sLink video" value="/q/CCQD/1/video" />

<input type="hidden" class="sLink music" value="/q/CCQD/1/music" />

<input type="hidden" class="sLink photo" value="/q/CCQD/1/photo" />

<input type="hidden" class="sLink games" value="/q/CCQD/1/game" />

<input type="hidden" class="sLink book" value="/q/CCQD/1/books_office" />

<input type="hidden" class="sLink featured_videos" value="/q/CCQD/1/video" />

<input type="hidden" id="sBreadcrumbsPhrase" value="Searching" />

<input type="text" id="searchQuery" placeholder="Search files" />

<input type="hidden" id="interval" value="600000" />

<input type="hidden" id="archiveReadyDownload" value="Your file is ready for download:" />

<input type="hidden" id="defAvatar" value="http://static.4shared.com/images/user2.png?ver=2906097813" />

<input type="hidden" id="zipAvatar" value="http://static.4shared.com/icons/32x32/zip.png?ver=655479399" />

<input type="hidden" id="b1Avatar" value="http://static.4shared.com/icons/32x32/b1.png?ver=703417425" />

<input type="hidden" id="torrentAvatar" value="http://static.4shared.com/icons/32x32/torrent.png?ver=1628575404" />

<input type="hidden" id="contactRequestText" value="Your friend $[p1] just joined 4shared." />

<input type="button" value="Ok" onclick="checkAndStartDownload(event);" style="width:80px" />

<input type="button" value="Cancel" onclick="hideTermsOfUse();" />

<input type="hidden" id="startTitle" value="Share" />

<input type="hidden" id="sharingFolderTitle" value="Share folder" />

<input type="hidden" id="sharingFileTitle" value="Share file" />

<input type="hidden" id="placeHolderEnterEmailAdresses" value="Enter names or e-mail addresses" />

<input type="hidden" id="dLinkPay" value="Direct link is available only for Premium Users.&lt;br&gt; Sign Up to premium account to get all 4shared Premium Features." />

<input type="hidden" id="premiumRequired" value="Premium account required!" />

<input type="hidden" id="hosted" value="Hosted at" />

<input type="hidden" id="fbInviteFolderTitle" value="I've shared a folder with you on 4shared. Find out what it is!" />

<input type="hidden" id="fbInviteFileTitle" value="I've shared a file with you on 4shared. Find out what it is!" />

<input type="hidden" id="contacts" value="Contacts" />

<input type="hidden" id="fb_share_folder_img" value="http://static.4shared.com/images/facebook/share_folder.png?ver=2422162001" />

<input type="hidden" id="fb_share_file_img" value="http://static.4shared.com/images/facebook/share_file.png?ver=1565381062" />

<input type="hidden" id="fb_redir_param" value="https://www.4shared.com/servlet/signin/facebook?fp=https://www.4shared.com/account/home.jsp" />

<input type="hidden" id="fileSuccessfullSent" value="Your file was successfully sent" />

<input type="hidden" id="folderSuccessfullSent" value="Your folder was successfully sent" />

<input type="hidden" id="fbRequestSharedText" value="I'd like to share $[p0] with you" />

<input type="hidden" id="fbSharingOff" value="null" />

<input type="hidden" id="fbInviteText" value="4shared.com - free web-based file sharing and storage." />

<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />

<input type="radio" class="writeFlag" name="permissions" value="write" />

<input class="lucida dark-gray selectable" id="simpleViewLink" type="text" readonly="readonly" />

<input type="text" id="emails" class="lucida f12 dark-gray tags gaClick" data-element="shF-2-1" name="emails" tabindex="3" />

<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />

<input type="radio" class="writeFlag" name="permissions" value="write" />

<input type="text" id="downloadFileLink" class="lucida f12 selectable" name="" tabindex="3" />

<input type="text" class="lucida f12 dark-gray selectable" name="" tabindex="4" value="" id="premiumDirectLink" />

<input type="text" class="lucida f12 selectable" id="fileHTMLembed" name="" tabindex="3" />

<input type="text" id="fileForumEmbed" class="lucida f12 selectable" name="" tabindex="4" />

<input type="text" class="lucida f12 selectable" id="fileEmbed" tabindex="5" />

<input class="lucida f12 dark-gray selectable" id="searchFriendsInput" type="text" placeholder="Search by name or e-mail address" />

<input id="tags_2" type="text" class="tags" />

<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />

<input type="radio" class="writeFlag" name="permissions" value="write" />

<input type="radio" class="readFlag" name="permissions" value="read" checked="checked" />

<input type="radio" class="writeFlag" name="permissions" value="write" />

<input type="text" class="lucida f12 ffshadow dark-gray" name="" tabindex="4" value="" id="subdomainInput" />

<input type="text" class="lucida f12 ffshadow dark-gray" name="" tabindex="3" value="" id="subdomainValue" readonly="true" />

<input type="hidden" id="allreadyPasswordProtectedMess" value="You can't set password for this folder, because the parent folder '$[1]' is password protected." />

<input type="hidden" id="passwordChangeConfirmTitle" value="Password Change" />

<input type="hidden" id="passwordChangeConfirmBody" value="Some child directory already password protected. &lt;br/&gt; Changing password of current directory will cause password overwrite on children's " />

<input type="hidden" id="confirmButtonMsg" value="Change" />

<input type="hidden" id="cancelButtonMsg" value="Cancel" />

<input type="text" class="passInput lucida f12" name="" tabindex="4" value="" id="passwordInput" />

<input type="password" class="passInput lucida f12" name="" tabindex="4" value="" id="changePasswordInput" readonly="true" />

<input type="hidden" id="previewLinkForEmbed" />

<input type="hidden" id="previewLinkForWidget" />

<input class="lucida f12 dark-gray" id="widget_width" type="text" style="width:30px;" />

<input class="lucida f12 dark-gray" id="widget_height" type="text" style="width:30px;" />

<input type="text" class="lucida f12 dark-gray selectable" name="" tabindex="3" id="htmlEmbed" />

<input type="text" class="lucida f12 dark-gray selectable" name="" tabindex="4" id="forumEmbed" />

<input type="text" value="http://www.4shared.com/android/i-EbooI0/batman_hd.html" readonly="readonly" onclick="this.focus();this.select()" class="field1 gaClick" data-element="16" dir="ltr" />

<input type="text" value="&lt;a href=&quot;http://www.4shared.com/android/i-EbooI0/batman_hd.html&quot; target=_blank&gt;batman hd.apk&lt;/a&gt;" readonly="readonly" onclick="this.focus();this.select()" class="field1 gaClick" data-element="17" dir="ltr" />

<input type="text" value="[URL=http://www.4shared.com/android/i-EbooI0/batman_hd.html]batman hd.apk[/URL]" readonly="readonly" onclick="this.focus();this.select()" class="field1 gaClick" data-element="18" dir="ltr" />

<input type="hidden" name="showComments" value="true" />

<input type="hidden" name="showPart" value="commentList" />

<input type="hidden" name="replyId" value="" />

<input type="hidden" id="norecaptcha" name="norecaptcha" value="" />

<input type="hidden" name="start" value="0" />

<input id="submitCommBtn" type="submit" value="Add New Comment" class="gaClick floatLeft f11 marginT10 round4 lucida no-line sendCommentButton" data-element="32" />

<input type="text" class="input-gray-big wide round4" id="recaptcha_response_field" name="recaptcha_response_field" style="width:250px" />

<input class="field2" id="submitCommBtn" type="submit" value="Confirm" />

<input type="text" name="fileName" value="4shared" class="xBox" />

<input type="hidden" name="newValue" value="" />

<input type="hidden" name="mode" value="" />

<input type="hidden" name="fid" value="3168935269" />

<input type="hidden" name="mode" value="3" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />

<input type="hidden" name="mode" value="3" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />

<input type="hidden" name="mode" value="3" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />

<input type="text" name="newValue" class="xBox" style="width:200px" />

<input type="hidden" name="mode" value="2" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12" onclick="quickEditCancel(1)" />

<input type="text" name="newValue" class="xBox" style="width:330px" onkeypress="return quickEditIsValidCharForFileName(event);" />

<input type="hidden" name="mode" value="10" />

<input type="hidden" name="fid" value="3168935269" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatRight marginL10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatRight" onclick="quickEditCancel();" />

<input type="hidden" name="mode" value="3" />

<input type="hidden" name="did" value="0" />

<input type="submit" value="Save" class="bluePopupButton marginT15 round5 f12 floatLeft marginR10" />

<input type="button" value="Cancel" class="grayPopupButton marginT15 round5 f12 floatLeft" onclick="quickEditCancel(1)" />

<input type="text" name="searchName" style="width:250px;padding:1px 0" class="ajax-suggestion field gaClick" data-element="fs1" autocomplete="off" />

<input type="submit" name="submitButton" value="Search" class="button gaClick" data-element="fs3" />

<input type="hidden" name="searchmode" value="2" />

Using try.jsoup.com also did not yield these input types like Chrome which suggests that it is not my code but rather Jsoup.

Reading through other threads suggest that Javascript may be changing the html after loading the webpage. There were no viable answers on how to fix this.

What am I doing wrong and how do I fix it?

This is my code for getting the full html page:

Document doc = Jsoup.connect("http://www.4shared.com/get/i-EbooI0/batman_hd.html").timeout(0).get();
System.out.println(doc.toString() + "\n\n\n\n");
Elements links = doc.select("input[type=hidden]");
for (org.jsoup.nodes.Element link : links) {
    System.out.println(link);
}

View Screenshot of needed values here

在此处输入图片说明

SOLUTION

Connection.Response response = Jsoup.connect("myUrl")
    .method(Connection.Method.GET)
    .execute();

Document homePage = Jsoup.connect("myUrl")
    .cookies(response.cookies())
    .get();

Modified version of code described here: Jsoup Cookies for HTTPS scraping . This gets the cookies as suggested by Niranjan and then reconnects to your Url.

Jsoup will clean up your HTML content while parsing and also It can handle your HTML though its not well-formed. Try to dump the html after parsing ie, Document.html() and check the dump if your discarded elements are eligible for your select clause.

UPDATE

Here you go, try this out, I'll explain you things if this works!!

public static void main(String[] args) throws IOException
{

    try
    {
        Map<String, String> cookieMap = new HashMap<String, String>();
        cookieMap.put("day1host", "h");
        cookieMap.put("d1.loginity.mark", "1");
        cookieMap.put("hostid", "-1314014314");
        cookieMap.put("__qca", "P0-2042580316-1371938383086");
        cookieMap.put("cd1v", "OOhB");
        cookieMap.put("c29", "1");
        cookieMap.put("__utma", "210074320.280144312.1371938377.1371938377.1371938377.1");
        cookieMap.put("__utmb", "210074320.4.10.1371938377");
        cookieMap.put("__utmc", "210074320");
        cookieMap.put("__utmz", "210074320.1371938377.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)");


        Document document = Jsoup.connect("http://www.4shared.com/get/i-EbooI0/batman_hd.html")
        .userAgent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36")
        .followRedirects(true)
        .cookies(cookieMap)
        .get();
        //System.out.println(document.html());
        //System.out.println("====================================================================");
        Elements elements = document.select("input[type=hidden]");
        for (Iterator<Element> iterator = elements.iterator(); iterator.hasNext();)
        {
            Element element = iterator.next();
            System.out.println(element);

        }
    }
    catch (Exception e)
    {
        e.printStackTrace();
    }

}

EXPLANATION

Im not sure if the below pattern is same for all the URL 's you are trying.

This is how the site is responding.

  1. There is a site redirection from /get/i-EbooI0/batman_hd.html to android/i-EbooI0/batman_hd.html . While redirection its sending out 2 cookies in response to the 1st request.

    第一个要求

  2. Few more cookies on the 2nd request.

    第二个要求

    No hidden fields in the <body> yet. Confirm this looking into the Elements tab.

  3. Now request http://www.4shared.com/get/i-EbooI0/batman_hd.html in the browser.

    第三请求

    Now you have the required Hidden fields in the < body> .

    在此处输入图片说明

Im performing Step 3 directly in the code.


Conclusion :

If you observe the same behavior for other URL 's as well then you have to write the code to catch the cookies of a Response and then pass them in the subsequent Request until you get the desired Hidden fields .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM