I want to create a whitelist to remove all html tags except head , body and i in a data. To do that I used Safelist class and jsoup library.
Safelist safe_list = Safelist.none();
safe_list.addTags(new String[] { "head", "body", "i"});
String data = "<head>Title here</head>
<body>
<p><b> paragraph 1</b></p>
<p><i> paragraph 2</i></p>
</body>";
String cleaned_data = Jsoup.clean(data,safe_list);
System.out.println(cleaned_data);
The expected result was
<head>
Title here
</head>
<body>
paragraph 1 <i>paragraph 2</i>
</body>
but the result I got
<body>
Title here paragraph 1 <i>paragraph 2</i>
</body>
Although head tag in the allowed list, it is removed from the data unlike body and i tag. What is the problem with head tag and what should I do to keep it in a data?
I found a solution. It may not be exact solution but it works in my case. The Jsoup official website has the following information:
The cleaner and these safelists assume that you want to clean a body fragment of HTML (to add user supplied HTML into a templated page), and not to clean a full HTML document. If the latter is the case, either wrap the document HTML around the cleaned body HTML, or create a safelist that allows html and head elements as appropriate.
Because creating a safelist that allows html and head elements as appropriate doesn't work, I took the first suggestion:
Safelist safe_list = Safelist.none();
safe_list.addTags(new String[] {"body", "i"});
String data = "<body>
<p><b> paragraph 1</b></p>
<p><i> paragraph 2</i></p>
</body>";
String cleaned_data = Jsoup.clean(data,safe_list);
cleaned_data = '<head>Title here</head>' + cleaned_data
System.out.println(cleaned_data);
Because the true structure for HTML file is:
<html>
<head>
<title>Page Title</title>
</head>
<body>
</body>
</html>
then your code should be written in this way:
Safelist safe_list = Safelist.none();
safe_list.addTags(new String[] { "head", "body", "i"});
String data = "<head><title>Title here</title></head>
<body>
<p><b> paragraph 1</b></p>
<p><i> paragraph 2</i></p>
</body>";
String cleaned_data = Jsoup.clean(data,safe_list);
System.out.println(cleaned_data)
when you just use <head> title hear</head>
then Jsoup think that the text between tag is "textNode".
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.