简体   繁体   English

Java HtmlUnit下载pdf文件

[英]Java HtmlUnit download pdf file

I want to download a pdf file from a website using HtmlUnit, but I haven't been able to do it. 我想使用HtmlUnit从网站上下载pdf文件,但我还没有做到这一点。 The download is triggered by clicking this: 通过单击以下按钮触发下载:

<form name="form" action="ADIR_24046/civil/documentos/docuN.php" method="post" target="w1">

    <input type="hidden" name="dtaDoc" value="7F547EA1167820365C20BA632B62A44E0B8F37564FCB3369284927C9763DE47F23DF398C061062F1">

    <i class="fa fa-file-pdf-o fa-lg" aria-hidden="true" style="color:#ab5659; cursor:pointer;" onclick="$(this).closest(&quot;form&quot;).submit();"></i>

</form>

So far every time I try to do it, when I go to open the files, it says they are corrupt. 到目前为止,每次我尝试打开文件时,它都表示文件已损坏。 My code for downloading the files is: 我下载文件的代码是:

public void getFile(HtmlTableRow row, String folio) throws IOException {        
    HtmlPage pdfPage = (HtmlPage) frame.executeJavaScript("document.getElementById('historiaCiv').children[0].children[0].children[" + 
    row.getIndex() + "].children[1].children[0].children[1].children[0].closest('form').submit()").getNewPage();

    ReadableByteChannel rbc = Channels.newChannel(pdfPage.getWebResponse().getContentAsStream());
    FileOutputStream fos = new FileOutputStream(/* download path */, false);
    fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
}

Is there any good way of doing this? 有什么好办法吗?

Without any more details and the real page to test i can only offer some hints for the problem solving. 没有更多的详细信息和要测试的真实页面,我只能为解决问题提供一些提示。

Split you problem into two: 将您的问题一分为二:

  1. click the correct element and make sure HtmlUnit downloads the pdf 单击正确的元素,并确保HtmlUnit下载pdf
  2. get the pdf from your program and save/analyze it 从程序中获取pdf并保存/分析

Before your start: 开始之前:

Make sure you have no javascript errors; 确保您没有JavaScript错误; maybe an error stops or breaks the processing. 错误可能会停止或中断处理。 Use the simples (default) setup of the webclient. 使用webclient的简单(默认)设置。 Change the config only to solve problems and make sure you know what you are doing. 仅更改配置以解决问题,并确保您知道自己在做什么。 And make sure you use the latest (Snapshot) version available. 并确保使用可用的最新(快照)版本。

Step 1: 第1步:

HtmlUnit works like a browser driven by you (your program) instead of a user clicking around. HtmlUnit的工作方式类似于由您(您的程序)驱动的浏览器,而不是用户四处浏览的浏览器。 There should be normally no need to inject javascript like you did in your sample. 通常,无需像示例中那样注入javascript。 Find the control the user usually clicks and simply call click on this. 找到用户通常单击的控件,然后简单地调用click。 Because of ajax you might wait after the click some time to get all the async stuff done. 由于ajax的原因,您可能需要等待一段时间才能完成所有异步操作。 Use a web proxy like Charles (or enable HttpClient wire logging) to see the network traffic. 使用类似于Charles的Web代理(或启用HttpClient有线日志记录)来查看网络流量。 Clicking the right control should lead to the pdf donwload visible in Charles. 单击正确的控件应导致在Charles中看到pdf下载。

Step 2 第2步

From you info i guess you are working with a page that does not do an ordinary pdf download on base of Html. 从您的信息中,我想您正在使用的不是基于HTML的pdf下载页面。 Today there are many 'clever' javascript frameworks around doing strange things to make the download more user friendly. 如今,有许多“聪明”的javascript框架围绕着做奇怪的事情来使下载更加用户友好。 This implies that the download is done async and for you the result of the click operation is usually the htmlpage instead of the pdf result. 这意味着下载是异步完成的,对于您来说,单击操作的结果通常是htmlpage而不是pdf结果。 If Step 1 was successful you have to get the newly opened window from the webclient and take the (pdf) content from this one. 如果第1步成功,则必须从Web客户端获取新打开的窗口,并从该窗口中获取(pdf)内容。

Hope that helps, if you need more help you have to provide more details (or maybe you can try to use a more high level tool like wetator that does a lot of magic to deal with all this strange pages). 希望对您有所帮助,如果您需要更多帮助,则必须提供更多详细信息(或者也许可以尝试使用诸如wetator之类的高级工具,该工具可以处理所有这些奇怪的页面,并且具有很多魔力)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM