[英]crawl a website for its href value from anchor tags using java
Baseurl =“ test.com/url”我試圖從具有test.com/url基本頁面的網站的錨鏈接中檢索href,因此我需要遞歸所有的hrefs並獲取所有的achor標簽值和將它們存儲在數組中。 我已經在下面實現了它,但事實證明這是一個無限循環,我無法弄清為什么會發生無限循環。 href的值存儲為“ ./jobs”;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.*;
public class test{
public static void main(String[] args) {
value("http://www.test.com/urls");
}
public static int getIndexOf(String str, String c, int n) {
int pos = str.indexOf(c, 0);
while (n-- > 0 && pos != -1){
pos = str.indexOf(c, pos + 1);
}
return pos;
}
public static List<String> list = new ArrayList<String>();
public static void value(String urladdr){
try {
URL my_url = new URL(urladdr);
System.out.println(urladdr);
BufferedReader br = new BufferedReader(new InputStreamReader(my_url.openStream()));
String strTemp = "";
while(true){
try{
strTemp = br.readLine();
}catch(NullPointerException e){
br.close();
break;
}
if(strTemp.contains("<a href=\"/urls/")){
if(!list.contains(compute(strTemp))){
list.add(compute(strTemp));
System.out.println(list);
}else{
br.close();
break;
}
}
}
br.close();
for (int i = 0; i < list.size(); i++) {
value("http://www.test.com"+list.get(i));
}
}catch(Exception e){
e.printStackTrace();
}
}
public static String compute(String strTemp){
int n = getIndexOf(strTemp, "/urls", 0);
String[] a = strTemp.substring(n).split(">");
String url = a[0].replaceAll("\"", "");
String value = a[1].replaceAll("</a", "");
return url;
}
}
如果您閱讀BufferedReader
/ readLine
的Java文檔,您會看到它說
A String containing the contents of the line, not including any line-termination
characters, or null if the end of the stream has been reached
因此只需更改代碼以進行測試即可幫助您。
strTemp = br.readLine();
if (strTemp == null) {
break;
}
....
finally {
br.close();
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.