簡體   English   中英

Java-獲取Xpath查詢的HTML頁面源代碼

[英]Java - Get HTML Page Source code for Xpath query

我正在嘗試做一些簡單的事情(至少我認為這很簡單),那就是從網頁中提取HTML代碼,然后創建一個DOM,這樣我就可以對它使用xPath查詢了。

我已經找到了無數示例,說明了如何在Java中為本地文件使用XML xPath,但是從網站上獲取源代碼后卻一無所獲。

我已經學習了如何使用以下代碼在PHP中執行此操作...

$url = 'pagehtmlhere'
$output = file_get_contents($url);
$doc = new DOMDocument();

libxml_use_internal_errors(true); //Supress Warnings for HTML5 conversion issue
$doc->loadHTML($output);
libxml_use_internal_errors(false); //Start Showing Errors

$xpath = new DOMXpath($doc);


$TitleString = "//h2[@class='title']/text()";
$BodyString = "//section[@id='body']/text()";
$ImageString = "//img[@id='iwi']/@src";



$titleQuery = $xpath->query($TitleString);
$title = $titleQuery->item(1)->nodeValue;

$bodyText = "";
$textQuery = $xpath->query($BodyString);

foreach($textQuery as $text){
    $bodyText .= $text->nodeValue . " ";
    }


$imageQuery = $xpath->query($ImageString);
$imageSrc = $imageQuery->item(0)->nodeValue;

但是我完全不知道如何在Java中執行此操作。

我嘗試了以下代碼...

            URL url = new URL(PageURL);
            URLConnection conn = url.openConnection();


            //FileInputStream file = new FileInputStream(new File("c:/employees.xml"));


            InputStream file = conn.getInputStream();
            DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();

            DocumentBuilder builder =  builderFactory.newDocumentBuilder();

            Document xmlDocument = builder.parse(file);

            XPath xPath =  XPathFactory.newInstance().newXPath();



           // System.out.println("*************************");
            String expression = "//div[contains(@class,\"carousel\")]/descendant-or-self::*[img]/img/@src')";
            //System.out.println(expression);
            String email = xPath.compile(expression).evaluate(xmlDocument);
           // System.out.println(email);

            Log.d("email", email);

但是,當然,我在[InputStream file = conn.getInputStream();]行出現錯誤,因為顯然這是錯誤的處理方式。

任何人都可以通過一個可行的例子來幫助我嗎? 並且請絕對不要使用任何HTML解析器,例如HTMLCleaner或類似的廢話。 我花了幾個小時試圖獲取HTML Cleaner來允許“ Asset” xPATH搜索,這真是一場噩夢,我真的不想處理它,也完全不必依賴別人的庫。

經過長時間的尋找,找到了答案。 我只需要做一個HTTP連接並將輸入流設置為那個...

        URL url = new URL(PageURL);

        HttpURLConnection c = (HttpURLConnection) url.openConnection();
        c.setConnectTimeout(8000);
        c.setReadTimeout(15000);
        BufferedReader inn = new BufferedReader(new InputStreamReader(
                c.getInputStream()));
        Log.d("TAG", "-----> Got response on Thread" + String.valueOf(j));
        StringBuffer sb = new StringBuffer("");
        String l = null;
        while ((l = inn.readLine()) != null) {
            sb.append(l);
        }
        inn.close();


        Document xmlDocument = builder.parse(sb.toString());

        XPath xPath =  XPathFactory.newInstance().newXPath();

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM