简体   繁体   中英

Java HtmlUnit download pdf file

I want to download a pdf file from a website using HtmlUnit, but I haven't been able to do it. The download is triggered by clicking this:

<form name="form" action="ADIR_24046/civil/documentos/docuN.php" method="post" target="w1">

    <input type="hidden" name="dtaDoc" value="7F547EA1167820365C20BA632B62A44E0B8F37564FCB3369284927C9763DE47F23DF398C061062F1">

    <i class="fa fa-file-pdf-o fa-lg" aria-hidden="true" style="color:#ab5659; cursor:pointer;" onclick="$(this).closest(&quot;form&quot;).submit();"></i>

</form>

So far every time I try to do it, when I go to open the files, it says they are corrupt. My code for downloading the files is:

public void getFile(HtmlTableRow row, String folio) throws IOException {        
    HtmlPage pdfPage = (HtmlPage) frame.executeJavaScript("document.getElementById('historiaCiv').children[0].children[0].children[" + 
    row.getIndex() + "].children[1].children[0].children[1].children[0].closest('form').submit()").getNewPage();

    ReadableByteChannel rbc = Channels.newChannel(pdfPage.getWebResponse().getContentAsStream());
    FileOutputStream fos = new FileOutputStream(/* download path */, false);
    fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
}

Is there any good way of doing this?

Without any more details and the real page to test i can only offer some hints for the problem solving.

Split you problem into two:

  1. click the correct element and make sure HtmlUnit downloads the pdf
  2. get the pdf from your program and save/analyze it

Before your start:

Make sure you have no javascript errors; maybe an error stops or breaks the processing. Use the simples (default) setup of the webclient. Change the config only to solve problems and make sure you know what you are doing. And make sure you use the latest (Snapshot) version available.

Step 1:

HtmlUnit works like a browser driven by you (your program) instead of a user clicking around. There should be normally no need to inject javascript like you did in your sample. Find the control the user usually clicks and simply call click on this. Because of ajax you might wait after the click some time to get all the async stuff done. Use a web proxy like Charles (or enable HttpClient wire logging) to see the network traffic. Clicking the right control should lead to the pdf donwload visible in Charles.

Step 2

From you info i guess you are working with a page that does not do an ordinary pdf download on base of Html. Today there are many 'clever' javascript frameworks around doing strange things to make the download more user friendly. This implies that the download is done async and for you the result of the click operation is usually the htmlpage instead of the pdf result. If Step 1 was successful you have to get the newly opened window from the webclient and take the (pdf) content from this one.

Hope that helps, if you need more help you have to provide more details (or maybe you can try to use a more high level tool like wetator that does a lot of magic to deal with all this strange pages).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM