简体   繁体   中英

How to get the specific tag where the text matches most of the words in a given list of words using JSOUP?

I'm trying to get the whole tag which has maximum number of words matching in the given list of words!. ie ex: Consider the html :

 <div id="productTitle" class="a-size-large">Hello world, good morning, have a happy day</div> <div id="productTitle2" class="a-size-large">Hello people of this planet!.</div> 

Consider the java code using jsoup lib :

String html = "<div id="productTitle" class="a-size-large">Hello world, good morning, have a happy day</div> <div id="productTitle2" class="a-size-large">Hello people of this planet!.</div>";
Document doc = Jsoup.parse(html);    
List<String> words = new ArrayList<>(Arrays.asList("hello", "world", "morning"));
Element elmnt = doc.select("*:matchesOwn("+words+")");
System.out.println(elmnt.cssSelector());

Expected output : #productTitle

Unfortunately there is no selector like this. You can create a little algorithm which does that instead:

Use Document.getAllElements() to get a list of all elements in your document. To get the actual text of an element use Element.ownText() . Now you can split that text to words and count all the words:

String html = "<div id=\"productTitle\" class=\"a-size-large\">Hello world, good morning, have a happy day</div> <div id=\"productTitle2\" class=\"a-size-large\">Hello people of this planet!.</div>";
Document doc = Jsoup.parse(html);
List<String> words = Arrays.asList("hello", "world", "morning");

Element elmnt = doc.getAllElements().stream()
        .collect(Collectors.toMap(e -> countWords(words, e.ownText()), Function.identity(), (e0, e1) -> e1, TreeMap::new))
        .lastEntry().getValue();

This uses Java Streams and a TreeMap to map the number of words to the element. If two or more elements have the same number of words the last ist used. I you like to use the first you can use (e0, e1) -> e0 .

To count the words given in a list you can also use Java Streams. You can use a method like this:

private long countWords(List<String> words, String text) {
    return Arrays.stream(text.split("[^\\w]+"))
            .map(String::toLowerCase)
            .filter(words::contains)
            .count();
}

This splits the text for all non word characters. You can change that to fit your needs.

The result of elmnt.cssSelector() for the HTML code you shared will be #productTitle .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM