简体   繁体   中英

How to parse a HTML Source Code without getting the entire source code.

I am interested in extracting a particular from the source code of a website. I am able to do this using JSoup, by getting the entire source code using

Document doc;
doc = Jsoup.connect("http://example.com").get();
Element divs = document.getElementById("importantDiv");

However, the problem is that I need to do this about 20000 times a day, to be able to get all the changes that are happening in the div. To create the whole document every time would use a lot of network bandwidth, which I would like to avoid. Is there a way to be able to extract the required element without re-creating the entire document on the client side.

NOTE : The code snippet is an example and not the actual URL or ID which I need to extract.

I don't believe you can request specific portions of a web page. JSoup is basically a web client class, and the web client has no control over what the server sends it. The server is the one that dictates what is sent, so you can't really request a segment of a webpage without requesting the entire web page.

Do you have access to this webpage, or is it an external website?

If you don't have control of the server side, you cannot do it. You will need to download the complete html. But note that it's just the HTML, not the rest of the resources like stylesheets, images, javascripts, etc.

To save bandwidth you would need to install some code in the server, so that it serves just the bits of information required.

Take a look at the URLConnection class, you can use it to open a connection to an URL get the connection's input stream and read only as much bytes as you need, this will work and you won't have to download the entire document, but unfortunately you won't be able to download the document starting from an offset. You will always have to start downloading the document from its beginning.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM