简体   繁体   中英

Find most frequent words on a webpage (using Jsoup)?

In my project I have to count the most frequent words in a Wikipedia article. I found Jsoup for parsing HTML format, but that still leaves the problem of word frequency. Is there a function in Jsoup that count the freqeuncy of words, or any way to find which words are the most frequent on a webpage, using Jsoup ?

Thanks.

Yes, you could use Jsoup to get the text from the webpage, like this:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
String text = doc.body().text();

Then, you need to count the words and find out which ones are the most frequent ones. This code looks promising. We need to modify it to use our String output from Jsoup, something like this:

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupWordCount {

   public static void main(String[] args) throws IOException {
        long time = System.currentTimeMillis();

        Map<String, Word> countMap = new HashMap<String, Word>();

        //connect to wikipedia and get the HTML
        System.out.println("Downloading page...");
        Document doc = Jsoup.connect("http://en.wikipedia.org/").get();

        //Get the actual text from the page, excluding the HTML
        String text = doc.body().text();

        System.out.println("Analyzing text...");
        //Create BufferedReader so the words can be counted
        BufferedReader reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8))));
        String line;
        while ((line = reader.readLine()) != null) {
            String[] words = line.split("[^A-ZÅÄÖa-zåäö]+");
            for (String word : words) {
                if ("".equals(word)) {
                    continue;
                }

                Word wordObj = countMap.get(word);
                if (wordObj == null) {
                    wordObj = new Word();
                    wordObj.word = word;
                    wordObj.count = 0;
                    countMap.put(word, wordObj);
                }

                wordObj.count++;
            }
        }

        reader.close();

        SortedSet<Word> sortedWords = new TreeSet<Word>(countMap.values());
        int i = 0;
        int maxWordsToDisplay = 10;

        String[] wordsToIgnore = {"the", "and", "a"};

        for (Word word : sortedWords) {
            if (i >= maxWordsToDisplay) { //10 is the number of words you want to show frequency for
                break;
            }

            if (Arrays.asList(wordsToIgnore).contains(word.word)) {
                i++;
                maxWordsToDisplay++;
            } else {
                System.out.println(word.count + "\t" + word.word);
                i++;
            }

        }

        time = System.currentTimeMillis() - time;

        System.out.println("Finished in " + time + " ms");
    }

    public static class Word implements Comparable<Word> {
        String word;
        int count;

        @Override
        public int hashCode() { return word.hashCode(); }

        @Override
        public boolean equals(Object obj) { return word.equals(((Word)obj).word); }

        @Override
        public int compareTo(Word b) { return b.count - count; }
    }
}

Output:

Downloading page...
Analyzing text...
42  of
24  in
20  Wikipedia
19  to
16  is
11  that
10  The
9   was
8   articles
7   featured
Finished in 3300 ms

Some notes:

  • This code can ignore some words, like "the", "and", "a" etc. You will have to customize it.

  • It seems to have problems with unicode characters sometimes. Although I don't experience this, someone in the comments did.

  • This could be done better and with less code.

  • Not well tested.

Enjoy !

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM