简体   繁体   中英

In java save webpage src and links using jsoup

Trying to save an entire webpage including the linked stylesheets and javascript. I can save the page but all the scripting and styling is lost when trying to bring up that saved page. I need to be able to save those linked sources as well as the html.

<link href="/thePage.css" rel="stylesheet" type="text/css">
<script language="Javascript" type="text/Javascript" src="/thePage.js"></script>

So far I have

Document doc = Jsoup.connect("http://www.thePage.com").get();
logger.info(doc.html());

This should be possible with JSoup, but takes some work. Once you have the Document you can use the select() (JSoupo selector) to retrieve matching Elements. So you would be able to do something like:

Elements media = doc.select("script[src]");
Elements links = doc.select("link[href]");

You could then iterate over the Elements found and download the media. You can do something like the following to download the files:

byte[] bytes = Jsoup.connect(linkUrl)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.referrer(URL_TO_PARSE)
.ignoreContentType(true)
.maxBodySize(0)
.timeout(600000)
.execute()
.bodyAsBytes();

Make sure you construct the urls appropriately that you pass to Jsoup.connect(). Relative paths can obviously be used for script/media locations.

You would then need to save the bytes to a file location, directory hierarchy that matches the expected references from the source HTML file. It can be quite a bit of work. Good Luck.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM