简体   繁体   English

JAVA:如何下载由servlet动态创建的网页

[英]JAVA: how to download webpage dynamically created by servlet

I want to download a source of a webpage to a file (*.htm) (ie entire content with all html markups at all) from this URL: 我想从以下URL将网页的源代码下载到文件(* .htm)(即带有所有html标记的全部内容):

http://isap.sejm.gov.pl/DetailsServlet?id=WDU20061831353 http://isap.sejm.gov.pl/Det​​ailsS​​ervlet?id=WDU20061831353

which works perfectly fine with FileUtils.copyURLtoFile method. 与FileUtils.copyURLtoFile方法完美配合。

However, the said URL has also some links, for instance one which I'm very interested in: 但是,上述网址也有一些链接,例如我非常感兴趣的一个链接:

http://isap.sejm.gov.pl/RelatedServlet?id=WDU20061831353&type=9&isNew=true http://isap.sejm.gov.pl/RelatedServlet?id=WDU20061831353&type=9&isNew=true

This link works perfectly fine If open it with a regular browser, but when I try to download it in Java by means of FileUtils -- I got only a no-content page with single message "trwa ladowanie danych" (which means: "loading data...") but then nothing happens, the target page is not loaded. 如果使用常规浏览器打开此链接,则该链接运行良好,但是当我尝试通过FileUtils在Java中下载该链接时,我只有一个没有内容的页面,带有单个消息“ trwa ladowanie danych”(这意味着:“正在加载数据...”),但是什么也没发生,目标页面没有加载。

Could anyone help me with this? 有人可以帮我吗? From the URL I can see that the page uses Servlets -- is there a special way to download pages created with servlets? 从URL中,我可以看到该页面使用Servlet-是否有一种特殊的方法来下载使用Servlet创建的页面?

Regards -- 问候 -

This isn't a servlet issue - that just happens to be the technology used to implement the server, but generally clients don't need to care about that. 这不是servlet的问题- 恰好是用于实现服务器的技术,但是通常客户不需要关心这一点。 I strongly suspect it's just that the server is responding with different data depending on the request headers (eg User-Agent). 我强烈怀疑只是服务器根据请求标头(例如User-Agent)以不同的数据进行响应。 I see a very different response when I fetch it with curl compared to when I load it in Chrome, for example. 例如,当我用curl抓取它时,与在Chrome中加载它时,看到的响应截然不同。

I suggest you experiment with curl , making a request which looks as close as possible to a request from a browser, and then fiddling until you can find out exactly which headers are involved. 我建议您尝试使用curl ,发出一个看起来尽可能接近浏览器请求的请求,然后再摆弄直到您可以准确地找到所涉及的标头。 You might want to use Wireshark or Fiddler to make it easy to see the exact requests/responses involved. 您可能希望使用WiresharkFiddler轻松查看涉及的确切请求/响应。

Of course, even if you can fetch the original HTML correctly, there's still all the Javascript - it would be entirely feasible for the HTML to contain none of the data, but for it to include Javascript which does the actual data fetching. 当然,即使您可以正确地获取原始HTML,仍然存在所有Javascript-HTML完全不包含任何数据,但包含可进行实际数据获取的Javascript是完全可行的。 I don't believe that's the case for this particular page, but you may well find it happens for 我认为特定页面不是这种情况,但是您很可能会发现这种情况发生在

try using selenium webdriver to the main page 尝试使用硒webdriver到主页

HtmlUnitDriver driver = new HtmlUnitDriver(true); 
driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
driver.get(baseUrl);

and then navigate to the link 然后导航到链接

driver.findElement(By.name("name of link")).click();

UPDATE: I checked the following: if I turn off the cookies in Firefox and then try to load my page: 更新:我检查了以下内容:如果我关闭了Firefox中的cookie,然后尝试加载我的页面:

http://isap.sejm.gov.pl/RelatedServlet?id=WDU20061831353&type=9&isNew=true http://isap.sejm.gov.pl/RelatedServlet?id=WDU20061831353&type=9&isNew=true

then I yield the incorrect result just like in my java app (ie page with "loading data" message instead of the proper content). 那么我会产生错误的结果,就像在我的Java应用程序中一样(即带有“正在加载数据”消息而不是适当内容的页面)。

Now, how can I manage the cookies in java to download this page properly then? 现在,如何管理Java中的cookie才能正确下载此页面?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM