简体   繁体   中英

How to parse data from HTML tags using Java

I am taking in a string from a website that looks along the lines of <HTML CODE HERE>Text I want to get and remove the brackets and the text within them, however, my end result is always null.

What I am trying is,

try {
        String desc = null;
        StringBuilder sb = new StringBuilder();
        BufferedReader r = new BufferedReader(new InputStreamReader(in));
        String line = null;
        boolean codeBlock;
        codeBlock = false;

        line = "<HTMLCODEHERE>Text I want to get";
        System.out.println("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! STARTING DESC: " + line);

        while((line = r.readLine()) != null) {
            if((line = r.readLine()) == "<") {
                codeBlock = true;
            }
            if((line = r.readLine()) == ">") {
                codeBlock = false;
            }
            if(!codeBlock) {
                sb.append(line);
                desc = sb.toString();
            }
        }

        System.out.println("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ENDING DESC: " + desc);
        holder.txtContent.setText(desc);
    } catch (IOException e) {
        e.printStackTrace();
    }

Have a look at the Java API for BufferedReader, namely readline:

Reads a line of text. A line is considered to be terminated by any one of a line feed ('\\n'), a carriage return ('\\r'), or a carriage return followed immediately by a linefeed.

https://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#readLine()

Therefore your code here:

if((line = r.readLine()) == "<") {
    codeBlock = true;
}
if((line = r.readLine()) == ">") {
    codeBlock = false;
}

Will never be true. Those calls also take you away from your current line of analysis.

If I understand your question correctly, you want all text in between any HTML tag? You could mess around with libraries like jsoup or go for a simpler implementation:

String parse = "<HTMLCODE>My favourite pasta is spaghetti, followed by ravioli</HTMLCODE>";

final char TAG_START = '<';
final char TAG_END = '>';

StringBuilder sb = new StringBuilder();

char[] parseChars = parse.toCharArray();

boolean inTag = true;
for (int i = 0; i< parseChars.length; i++) {
    if (parseChars[i] == TAG_START) {
        inTag = true;
        continue;
    }
    else if (parseChars[i] == TAG_END) {
        inTag = false;
        continue;
    }
    if (!inTag) {
        sb.append(parseChars[i]);
    }
}

System.out.println(sb.toString());

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM