简体   繁体   中英

Parsing HTML file using Xpath in JAVA

I have a Java code that could read the source of a URL and save to a file(source.html) and now from the saved page, I want to extract some value using XPath. Suppose I want to read the price - //div [@itemprop='price'] //text ()

How to do this further, Will I be able to do this directly in the saved HTML page or I should first convert this to an XML file and then use XPath. I have heard about HTML cleaners/Parsers should I use this here ? Please do not point to another website for answers. If so route me to a spot where I can make a direct and simple lesson. Modifying the below code would be highly helpful.

import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;

import org.jsoup.Jsoup;

public class jSoupContentRead {
    @SuppressWarnings("resource")
    public static void main(String[] args) throws IOException {
        FileWriter FR = new FileWriter("source.html");
        PrintWriter op = new PrintWriter(FR);

        org.jsoup.nodes.Document doc = Jsoup.connect(
                "http://itunes.apple.com/us/book/a-way-home/id982665320?mt=11")
                .get();

        op.write(doc.toString());
        System.out.println(doc.toString());
    }
}

Generally (cross languages) XPath is to be applyied to the DOM structure. In php there is a standart procedure:

  1. Get html
  2. Make it a valid xml (might be an optional step)
  3. Make of it a DOMDocument object instance
  4. Make of it a DOMXPath object instance
  5. Apply xpath query to this DOMXPath instance. See an example in php .

I think there should be something similar in JAVA.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM