Java-獲取Xpath查詢的HTML頁面源代碼

Question

我正在嘗試做一些簡單的事情（至少我認為這很簡單），那就是從網頁中提取HTML代碼，然后創建一個DOM，這樣我就可以對它使用xPath查詢了。

我已經找到了無數示例，說明了如何在Java中為本地文件使用XML xPath，但是從網站上獲取源代碼后卻一無所獲。

我已經學習了如何使用以下代碼在PHP中執行此操作...

$url = 'pagehtmlhere'
$output = file_get_contents($url);
$doc = new DOMDocument();

libxml_use_internal_errors(true); //Supress Warnings for HTML5 conversion issue
$doc->loadHTML($output);
libxml_use_internal_errors(false); //Start Showing Errors

$xpath = new DOMXpath($doc);


$TitleString = "//h2[@class='title']/text()";
$BodyString = "//section[@id='body']/text()";
$ImageString = "//img[@id='iwi']/@src";



$titleQuery = $xpath->query($TitleString);
$title = $titleQuery->item(1)->nodeValue;

$bodyText = "";
$textQuery = $xpath->query($BodyString);

foreach($textQuery as $text){
    $bodyText .= $text->nodeValue . " ";
    }


$imageQuery = $xpath->query($ImageString);
$imageSrc = $imageQuery->item(0)->nodeValue;

但是我完全不知道如何在Java中執行此操作。

我嘗試了以下代碼...

            URL url = new URL(PageURL);
            URLConnection conn = url.openConnection();


            //FileInputStream file = new FileInputStream(new File("c:/employees.xml"));


            InputStream file = conn.getInputStream();
            DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();

            DocumentBuilder builder =  builderFactory.newDocumentBuilder();

            Document xmlDocument = builder.parse(file);

            XPath xPath =  XPathFactory.newInstance().newXPath();



           // System.out.println("*************************");
            String expression = "//div[contains(@class,\"carousel\")]/descendant-or-self::*[img]/img/@src')";
            //System.out.println(expression);
            String email = xPath.compile(expression).evaluate(xmlDocument);
           // System.out.println(email);

            Log.d("email", email);

但是，當然，我在[InputStream file = conn.getInputStream（）;]行出現錯誤，因為顯然這是錯誤的處理方式。

任何人都可以通過一個可行的例子來幫助我嗎？ 並且請絕對不要使用任何HTML解析器，例如HTMLCleaner或類似的廢話。 我花了幾個小時試圖獲取HTML Cleaner來允許“ Asset” xPATH搜索，這真是一場噩夢，我真的不想處理它，也完全不必依賴別人的庫。

Answer 1

經過長時間的尋找，找到了答案。 我只需要做一個HTTP連接並將輸入流設置為那個...

        URL url = new URL(PageURL);

        HttpURLConnection c = (HttpURLConnection) url.openConnection();
        c.setConnectTimeout(8000);
        c.setReadTimeout(15000);
        BufferedReader inn = new BufferedReader(new InputStreamReader(
                c.getInputStream()));
        Log.d("TAG", "-----> Got response on Thread" + String.valueOf(j));
        StringBuffer sb = new StringBuffer("");
        String l = null;
        while ((l = inn.readLine()) != null) {
            sb.append(l);
        }
        inn.close();


        Document xmlDocument = builder.parse(sb.toString());

        XPath xPath =  XPathFactory.newInstance().newXPath();

Java-獲取Xpath查詢的HTML頁面源代碼

問題描述

1 個解決方案

解決方案1
0 已采納 2014-12-12 04:35:50

Java-獲取Xpath查詢的HTML頁面源代碼

問題描述

1 個解決方案

解決方案1 0 已采納 2014-12-12 04:35:50

解決方案1
0 已采納 2014-12-12 04:35:50