简体   繁体   中英

Reading an html page content and parsing the content in JSP

In this Java web application project I'm first, trying to read the content of a page with getUrlContentString() method (seem to be working) and second, only display the content between tags using the method proccessString () . The second method does not seem to be responding as expected and it returns a blank page. What is causing the problem?

index.jsp

<%@page contentType="text/html" pageEncoding="UTF-8"%>
<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>JSP Page</title>
    </head>
    <body>
        <%= cookiePac.CookieJar.getUrlContentString("http://help.websiteos.com/"
                + "websiteos/example_of_a_simple_html_page.htm")%>
        <p>
            <%= cookiePac.CookieJar.proccessString()%>
        </p>

    </body>
</html>

CookieJar.java

package cookiePac;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class CookieJar {
    private final List<String> cookies;
    private static String rawCookiesString = "";
    private static String rawCookiesString_1 = "";
    public CookieJar () {
        this.cookies = new ArrayList<>();
    }
    /* read the page, store into rawCookiesString */
    public static String getUrlContentString (String theUrl) {
        StringBuilder content = new StringBuilder();
        try {
            URL url = new URL(theUrl);
            URLConnection urlConnection = url.openConnection();
            BufferedReader bufferedReader = new BufferedReader(
                    new InputStreamReader(urlConnection.getInputStream()));
            String line;
            while ((line = bufferedReader.readLine()) != null) {
                content.append(line + "\n");
            }
            bufferedReader.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
         rawCookiesString = content.toString();
         return " ";
    }
    /* select the content between <a>  */

    public static String proccessString () {
        Pattern p = Pattern.compile("<a>(.*?)</a>");
        Matcher m = p.matcher(rawCookiesString);
        if (m.find()) {
           rawCookiesString_1 = m.group(1);
        }
        return rawCookiesString_1.toString();
    }
}

I've created a project with your code. I saw some problems there. Here they are.

  1. First of all, a static html that you get with the url you've specified - not the one you see in your browser console window, but the one without scripts being executed - does not contain anchor tags. That's why you cannot get any content of this tag. Take, for example, this URL: http://www.cssdesignawards.com/ - instead of yours http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm .

  2. Secondly, you're trying to match a tag in this fashion: "<a>(.*?)</a>" . But in fact it's very hard to match any anchor tag content with this regex, because usually CSS classes are used, so the way that increases chances to match anchor content is to use "<a(.*?)</a>" instead of "<a>(.*?)</a>" .

  3. Next, your getUrlContentString method is named to return html as a string, but it always returns just a blank string. Consider renaming this method or returning rawCookiesString .
  4. Moreover, you have a lot of static methods. Java is object-oriented language, and it's much better to use non-static methods for primary logic of application.
  5. And finally, to parse html, I recommend you to use JSoup library . It's not very hard to get acquainted with it, and it provides really great opportunities for html parsing. For example, here is a cookbook to extract information from tags.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM