简体   繁体   中英

Removing html tags

I have a professor requesting that we remove HTML tags (anything in < and >) without the use of the removeAll method.

I currently have this:

public static void main(String[] args)
        throws FileNotFoundException {
    Scanner input = new Scanner(new File("src/HTML_1.txt"));
    while (input.hasNext())
    {
        String html = input.next();
        System.out.println(stripHtmlTags(html));
    }

}

static String stripHtmlTags(String html)
{
    int i;
    String[] str = html.split("");
    String s = "";
    boolean tag = false;

    for (i = html.indexOf("<"); i < html.indexOf(">"); i++) 
    {
        tag = true;
    }

    if (!tag) 
    {
        for (i = 0; i < str.length; i++) 
        {
            s += str[i];
        }
    }
    return s;   
}

This is what is inside the file:

<html>
<head>
<title>My web page</title>
</head>
<body>
<p>There are many pictures of my cat here,
as well as my <b>very cool</b> blog page,
which contains <font color="red">awesome
stuff about my trip to Vegas.</p>


Here's my cat now:<img src="cat.jpg">
</body>
</html>

This is what the output should look like:

My web page


There are many pictures of my cat here,
as well as my very cool blog page,
which contains awesome
stuff about my trip to Vegas.


Here's my cat now:

String is immutable in Java + You never display anything

I recommend you close your Scanner when done with it (as a best practice), and reading the HTML_1.txt file from the user's HOME directory. The simplest way to close is a try-with-resources like

public static void main(String[] args) {
    try (Scanner input = new Scanner(new File(
            System.getProperty("user.home"), "HTML_1.txt"))) {
        while (input.hasNextLine()) {
            String html = stripHtmlTags(input.nextLine().trim());
            if (!html.isEmpty()) { // <-- removes empty lines.
                System.out.println(html);
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
}

Because String is immutable I would recommend a StringBuilder to remove the HTML tags like

static String stripHtmlTags(String html) {
    StringBuilder sb = new StringBuilder(html);
    int open;
    while ((open = sb.indexOf("<")) != -1) {
        int close = sb.indexOf(">", open + 1);
        sb.delete(open, close + 1);
    }
    return sb.toString();
}
When I run the above I get
 My web page There are many pictures of my cat here, as well as my very cool blog page, which contains awesome stuff about my trip to Vegas. Here's my cat now: 

Unless I'm going insane you aren't printing anything. The changes are returned and immediately are destroyed since no function or variable is receiving the returned string.

Change

stripHtmlTags(html);

to

System.out.println(stripHtmlTags(html));

Also you're setting tag true or false the applying that to the entire line. You need to keep track if you're in a tag and ignore those characters if you are.

So loop through each letter of string html . If it is a < you know that a tag is starting, else if it is a > a tag is ending, if it isn't either of these (anything else) then check if you're in a tag (boolean tag) if you aren't add it to the string.

Like most things in life, there is more then one way to get this to work, but to the main problem...

for (i = html.indexOf("<"); i < html.indexOf(">"); i++) {
    tag = true;
}

if (!tag) {
    for (i = 0; i < str.length; i++) {
        s += str[i];
    }
}

The text starts with <html> , this means that when the first for-loop ends, i will equal 4 and tag will be true , which means it skips the if block and then...exists the method...

You need to keep looping until you run out of text...

The simplest solution might be to simply start at the start of the String and check each character, ignoring everything between <...>

StringBuilder sb = new StringBuilder(64);
boolean ignore = false;
for (int index = 0; index < text.length(); index++) {
    if (text.charAt(index) == '<') {
        ignore = true;
    } else if (text.charAt(index) == '>') {
        ignore = false;
    } else if (!ignore) {
        sb.append(text.charAt(index));
    }
}
return sb.toString();

Then make sure you print the result System.out.println(stripHtmlTags(html));

Another solution (which would be more efficient), would be to trim off all the <...> content from the start of the String , until there is nothing left of the String ...

StringBuilder html = new StringBuilder(text);
StringBuilder result = new StringBuilder(64);
int index = 0;
while (html.length() > 0) {

    int startIndex = html.indexOf(">");
    if (index == -1) {
        // Only plain text remaining...
        result.append(html.toString());
        html.delete(0, html.length());
    } else {
        html.delete(0, startIndex + 1);
        int endIndex = html.indexOf("<");
        if (endIndex > 0) {
            result.append(html.substring(0, endIndex));
            html.delete(0, endIndex);
        }
    }

}
return result.toString();

I've used StringBuilder here as it's more efficient then trying to do String concatenation or assigning the results of String#substring back to another String

And if you want to be "super", you could use regular expression and String#split

String[] parts = text.split("<(.*?)>");
StringBuilder sb = new StringBuilder(64);
for (String part : parts) {
    sb.append(part);
}
return result.toString();

A small recursive method

static String stripHtmlTags2(String html)
{           
    int startIndex = html.indexOf("<");
    int endIndex = html.indexOf(">");
    String stripedString = html;
    //Assuming an end for every start tag
    if (startIndex!=-1){
        stripedString = html.substring(0,startIndex);           
        stripedString = stripedString+html.substring(endIndex+1);
        stripedString = stripHtmlTags2(stripedString);
    }

    return stripedString;
}

Use like (in your main)

StringBuilder htmlFreeString = new StringBuilder();
while (input.hasNextLine())
  {
       String html = input.nextLine();
       htmlFreeString.append(stripHtmlTags2(html));
   }
 System.out.print(htmlFreeString.toString());

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM