[英]Java - Get HTML Page Source code for Xpath query
我正在嘗試做一些簡單的事情(至少我認為這很簡單),那就是從網頁中提取HTML代碼,然后創建一個DOM,這樣我就可以對它使用xPath查詢了。
我已經找到了無數示例,說明了如何在Java中為本地文件使用XML xPath,但是從網站上獲取源代碼后卻一無所獲。
我已經學習了如何使用以下代碼在PHP中執行此操作...
$url = 'pagehtmlhere'
$output = file_get_contents($url);
$doc = new DOMDocument();
libxml_use_internal_errors(true); //Supress Warnings for HTML5 conversion issue
$doc->loadHTML($output);
libxml_use_internal_errors(false); //Start Showing Errors
$xpath = new DOMXpath($doc);
$TitleString = "//h2[@class='title']/text()";
$BodyString = "//section[@id='body']/text()";
$ImageString = "//img[@id='iwi']/@src";
$titleQuery = $xpath->query($TitleString);
$title = $titleQuery->item(1)->nodeValue;
$bodyText = "";
$textQuery = $xpath->query($BodyString);
foreach($textQuery as $text){
$bodyText .= $text->nodeValue . " ";
}
$imageQuery = $xpath->query($ImageString);
$imageSrc = $imageQuery->item(0)->nodeValue;
但是我完全不知道如何在Java中執行此操作。
我嘗試了以下代碼...
URL url = new URL(PageURL);
URLConnection conn = url.openConnection();
//FileInputStream file = new FileInputStream(new File("c:/employees.xml"));
InputStream file = conn.getInputStream();
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(file);
XPath xPath = XPathFactory.newInstance().newXPath();
// System.out.println("*************************");
String expression = "//div[contains(@class,\"carousel\")]/descendant-or-self::*[img]/img/@src')";
//System.out.println(expression);
String email = xPath.compile(expression).evaluate(xmlDocument);
// System.out.println(email);
Log.d("email", email);
但是,當然,我在[InputStream file = conn.getInputStream();]行出現錯誤,因為顯然這是錯誤的處理方式。
任何人都可以通過一個可行的例子來幫助我嗎? 並且請絕對不要使用任何HTML解析器,例如HTMLCleaner或類似的廢話。 我花了幾個小時試圖獲取HTML Cleaner來允許“ Asset” xPATH搜索,這真是一場噩夢,我真的不想處理它,也完全不必依賴別人的庫。
經過長時間的尋找,找到了答案。 我只需要做一個HTTP連接並將輸入流設置為那個...
URL url = new URL(PageURL);
HttpURLConnection c = (HttpURLConnection) url.openConnection();
c.setConnectTimeout(8000);
c.setReadTimeout(15000);
BufferedReader inn = new BufferedReader(new InputStreamReader(
c.getInputStream()));
Log.d("TAG", "-----> Got response on Thread" + String.valueOf(j));
StringBuffer sb = new StringBuffer("");
String l = null;
while ((l = inn.readLine()) != null) {
sb.append(l);
}
inn.close();
Document xmlDocument = builder.parse(sb.toString());
XPath xPath = XPathFactory.newInstance().newXPath();
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.