I'm trying to get the whole tag which has maximum number of words matching in the given list of words!. ie ex: Consider the html :
<div id="productTitle" class="a-size-large">Hello world, good morning, have a happy day</div> <div id="productTitle2" class="a-size-large">Hello people of this planet!.</div>
Consider the java code using jsoup lib :
String html = "<div id="productTitle" class="a-size-large">Hello world, good morning, have a happy day</div> <div id="productTitle2" class="a-size-large">Hello people of this planet!.</div>";
Document doc = Jsoup.parse(html);
List<String> words = new ArrayList<>(Arrays.asList("hello", "world", "morning"));
Element elmnt = doc.select("*:matchesOwn("+words+")");
System.out.println(elmnt.cssSelector());
Expected output : #productTitle
Unfortunately there is no selector like this. You can create a little algorithm which does that instead:
Use Document.getAllElements()
to get a list of all elements in your document. To get the actual text of an element use Element.ownText()
. Now you can split that text to words and count all the words:
String html = "<div id=\"productTitle\" class=\"a-size-large\">Hello world, good morning, have a happy day</div> <div id=\"productTitle2\" class=\"a-size-large\">Hello people of this planet!.</div>";
Document doc = Jsoup.parse(html);
List<String> words = Arrays.asList("hello", "world", "morning");
Element elmnt = doc.getAllElements().stream()
.collect(Collectors.toMap(e -> countWords(words, e.ownText()), Function.identity(), (e0, e1) -> e1, TreeMap::new))
.lastEntry().getValue();
This uses Java Streams and a TreeMap
to map the number of words to the element. If two or more elements have the same number of words the last ist used. I you like to use the first you can use (e0, e1) -> e0
.
To count the words given in a list you can also use Java Streams. You can use a method like this:
private long countWords(List<String> words, String text) {
return Arrays.stream(text.split("[^\\w]+"))
.map(String::toLowerCase)
.filter(words::contains)
.count();
}
This splits the text for all non word characters. You can change that to fit your needs.
The result of elmnt.cssSelector()
for the HTML code you shared will be #productTitle
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.