简体   繁体   中英

How to extract links from a webpage using jsp?

My requirement is to extract all links (using "a href") from a web page dynamically. I am using JSP . To be more specific, i am building a meta search engine in JSP. So when user enters a query item, i have to extract the links from the search results pages of yahoo, ask, google, momma etc. For getting the pages in string format, the code i am using right now is.

> > try  
{  
>  String sUrl_yahoo = "http://www.mamma.com/result.php?type=web&q=hai+bird&j_q=&l=";
> 
>       String nextLine;  
>       String webPage;  
>       StringBuffer wPage;  
>       String sSql;  
>       java.net.URL siteURL = new java.net.URL (sUrl_yahoo);  
>       java.net.URLConnection siteConn = siteURL.openConnection();  
>       java.io.BufferedReader in = new java.io.BufferedReader ( new java.io.InputStreamReader(siteConn.getInputStream() ) );  
>         wPage = new StringBuffer(30*1024);  
>         while ( ( nextLine = in.readLine() ) != null ) {
> wPage.append(nextLine); }  
>         in.close();  
>         webPage = wPage.toString();       out.println(webPage);       }  
> catch(Exception e)   {  
> out.println("Error" + e);   }

Now, my request is: Can you suggest some way to extract the links from the String webPage ? Or is there some other way to extract those links ? I would prefer doing it without using any external packages.

One quick solution would be to use a regex Matcher object to pull the URLs out:

Pattern p = Pattern.compile("<a +href=\"([a-zA-z0-9\\:\\-\\/\\.]+)\">");
Matcher m = p.matcher(webPage);

ArrayList<String> foundUrls = new ArrayList<String>();

while(m.find()) {
  foundUrls.add(m.group(1));
}

You might have to play around with the URL pattern a little to make it more airtight, but this is a quick and dirty solution without using external libraries.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM