我们如何使用 java 中的字符串操作从 html 文件中拆分单词？

Question

I need to create a method that reads a html file then display the number of word occurrence.我需要创建一个读取 html 文件然后显示单词出现次数的方法。

for example: String [] words = {"happy", "nice", "good"};例如：String [] words = {"happy", "nice", "good"};

The word happy was used 7 times.快乐这个词被使用了 7 次。 The word nice was used 1 times.好这个词被使用了 1 次。 The word happy was used 2 times.快乐这个词用了2次。

This is what I did:这就是我所做的：

public static void ReadWriteDisplay() {
    
 Path in = Paths.get("E:\\TextToHTML.html");
 Path out = Paths.get("E:\\HTMLToText.txt");
 String s = "";
 String str = "";
 try {
    InputStream input = new BufferedInputStream(Files.newInputStream(in));
    BufferedReader reader = new BufferedReader(new InputStreamReader(input));
        
    OutputStream output = new BufferedOutputStream(Files.newOutputStream(out, CREATE, WRITE, TRUNCATE_EXISTING));
    BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(output));
        
    s = reader.readLine();
    while(s != null) {
      str += s;
      writer.write(s);
      writer.newLine();
      s = reader.readLine();
    }
reader.close();
writer.close();
        
String a[] = str.split(" ");
System.out.println("str: "+str);
String [] positive = {"happy", "nice", "good", "joy", "love"};
int [] count = {0, 0, 0, 0, 0};
for (int i = 0; i < a.length; i++) {
    if(positive[0].equalsIgnoreCase(a[i]))
                count[0]++;
    if(positive[1].equalsIgnoreCase(a[i]))
                count[1]++;
    if(positive[2].equalsIgnoreCase(a[i]))
                count[2]++;
    if(positive[3].equalsIgnoreCase(a[i]))
                count[3]++;
    if(positive[4].equalsIgnoreCase(a[i]))
                count[4]++;
}
        
for (int x = 0; x < 5; x++) {
    System.out.println("The word "+positive[x]+" was used "+count[x]+" times.");
}
        
} catch(Exception e) {
    System.err.println("Message: "+ e);
  } 
}

My method runs but it does not provide accurate number of occurrence.我的方法运行但它没有提供准确的出现次数。 The reason because some words in html are enclosed in <> which caused <>Hello<> to be stored in my string array instead of the word Hello.原因是 html 中的某些单词包含在 <> 中，这导致 <>Hello<> 存储在我的字符串数组中，而不是单词 Hello。

Here is the sample output:这是示例输出：

str: <!DOCTYPE html><html lang="en"><head>    <meta charset="utf-8">    <meta http-equiv="X-UA-Compatible" content="IE=edge">    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>    <meta http-equiv="content-language" content="en" />    <meta name="viewport" content="width=device-width, initial-scale=1">    <meta name="google-site-verification" content="rUp8isOBygjhxPJ2qyy6QtBi9vWRFhIboMXucJsCtrE" />    <title>JustPaste.it - Share Text &amp; Images the Easy Way</title>    <link rel="preload" href="/static/img/jp_logo_1_en_v4.png" as="image" />                <meta name="robots" content="noindex, nofollow" />        <meta name="googlebot" content="noindex, nofollow" />                                <link rel="preload" href="/build/global.395f53d0.css" as="style" />            <link rel="stylesheet" type="text/css"  href="/build/global.395f53d0.css" />                    <link rel="shortcut icon" href="/static/other/fav.ico" />             <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->        <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->        <!--[if lt IE 9]>            <script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>            <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>        <![endif]-->        <script>      window.article = {"id":42017684,"url":"https:\/\/justpaste.it\/6fn9m","shortUrl":"https:\/\/jpst.it\/2wiek","pdfUrl":"https:\/\/justpaste.it\/6fn9m\/pdf","qrCodeData":"data:image\/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFcAAABXCAIAAAD+qk47AAAACXBIWXMAAA7EAAAOxAGVKw4bAAACCklEQVR4nO2by27DMAwEx0X\/\/5fTAwFdaNB8SEmB7BzjSDEWy4ikpOv1evH1\/Hz6Bf4FUgGkgiEVQCoYv\/6j67omM65FJzOPX6HWKD9PaebSj8oLIBWMm4hYlBIq79Jg+Pqyd3vpR4dvuJAXQCoYUUQsAi9lPOlt74dnloZzbygvgFQwUhExpJft9EKjh7wAUsF4R0QE+Bh5g\/898gJIBSMVEUNzDjOiDMN55AWQCkYUEcOWTqlrtL18KCEvgFQwbiJie7qSMXkpELa\/obwAUsFI7UcEpXHw397bmMh0cXtJVzBKXgCpYFyB3xYlT\/Ye3bzZ7q264EflBZAKRmqHLmPyYJR\/5IeXEqrt8SgvgFQwojoiY9feEpN5VCLo4maQF0AqGLVzTcM\/50UpEdpVj+sUxwNSAao7dJk6erHrhN65umYhL4BUMGoRUTJ56TsBw\/UoM0peAKlg1CrrRamgLnEu6VLW9IBUgLj7Ouz\/DJePHr16RF4AqWA096yDc92lCXs3hjzDyJIXQCoYB+\/Q9Q4vDS9cBPOojnhAKsDRO3R+nl3dp94uhrKmB6QCHL1Dlznp1GsWbUdeAKlgvOPGUK8juqt5mymx5QWQCsbBiCglS5+9KCEvgFQwDt6hO3djdHtfV14AqWAcvEO36B1M6mVNvQpFXgCpYNzs0H0h8gJIBUMqgFQwpALAH\/JvmLtnlWjnAAAAAElFTkSuQmCC"};      window.statsUrl = 'https\u003A\/\/stats.justpaste.it';      window.viewKey = 'x6ER';      window.barOptions = {"isLoggedIn":false,"hasPublicProfile":false,"displayOwnership":false,"isArticleOwner":false,"isPasswordProtected":false,"isCaptchaRequired":null,"isCaptchaEntered":false,"captchaSettings":null,"premiumUserData":null,"isPrivate":false,"isExpired":false,"expireAfterRead":false,"isShared":false,"defaultAvatar":"\/static\/img\/avatar60.jpg","createdText":"6h","showLastEdit":false,"modifiedText":"6h","isInTrash":false,"viewsText":"2","favouritesCount":0,"onlineText":"1","getFavouriteArticleUrl":"https:\/\/justpaste.it\/api\/account\/v1\/favourite-article\/42017684","addFavouriteArticleUrl":"https:\/\/justpaste.it\/api\/account\/v1\/favourite-article","removeFavouriteArticleUrl":"https:\/\/justpaste.it\/api\/account\/v1\/favourite-article-delete\/42017684","apiShowArticleDynamicUrl":"\/api\/v1\/article-dynamic","voteUrl":"\/api\/account\/v1\/vote","contentLang":"en","positiveVotes":0,"negativeVotes":0,"currentVote":"empty","linkSharingUrl":null,"linkSharingSecret":null};          </script>        <script src="/build/runtime.a1e5a72a.js" async></script>        <script src="/build/1676.2c557867.js" async></script>        <script src="/build/8452.a9a1e0c5.js" async></script>        <script src="/build/5936.ad26e56d.js" async></script>        <script src="/build/9412.4a605741.js" async></script>        <script src="/build/showarticlewidget.3bbca334.js" async></script>        </head><body marginwidth="0" dir="ltr" marginheight="0"><!-- Static navbar --><div class="navbar navbar-default navbar-static-top mainTableTopMiddle" role="navigation">    <div class="container">        <div class="navbar-header pull-left">            <a href="/"><img src="/static/img/jp_logo_1_en_v4.png" width="186px" height="54px" alt="JustPaste.it" /></a>        </div>        <div class="navbar-header pull-left">            <div class="nav navbar-nav mainTableTopMiddleRight hidden-xs hidden-sm">                <img src="/static/img/jp_logo_2_en_v5.png" width="390px" height="54px" />            </div>        </div>        <div class="navbar-header pull-right" style="padding-top:8px">            <div id="mainPanelButtons"></div>        </div>    </div><!--/.nav-collapse --></div><div id="headContainer" class="container" style="max-width: 960px">    <div class="row">        <div class="col-md-12">            <div id="mainTableContent">                <div style="max-width: 960px; vertical-align: top">            <div id="showArticleWidget"><div class="showArticleWidgetPlaceholder"></div></div>        <div id="articleContent">        <p>happy</p> <p>nice nice</p> <p>good good good</p> <p>joy Joy joy Joy joy</p> <p>Love love Love love Love</p>    </div>            <div id="showArticleBottomWidget"><div class="articleBottomWidgetPlaceholder"></div></div>    <span style="visibility:hidden" class="glyphicon glyphicon-link"></span></div>            </div>        </div>    </div> <!-- /row --></div> <!-- /container --><div id="footer" style="min-height: 30px;">    <div class="container" style="vertical-align: middle">        <div class="col-md-3 col-xs-5 col-sm-4 text-muted" style="font-size: 95%;" align="left">            &copy; 2021 <span class="hidden-xs">justpaste.it</span>        </div>        <div class="col-md-9 col-xs-7 col-sm-8 text-muted"  align="right">            <ul class="list-inline basePageFooterList">                <li class="hidden-xs">                    <a href="/login">Account</a>                </li>                <li class="hidden-xs">                    <a href="/terms">Terms</a>                </li>                <li class="hidden-xs">                    <a href="/privacypolicy">Privacy</a>                </li>                <li class="hidden-xs">                    <a href="/cookies">Cookies</a>                </li>                <li>                    <a href="/u/justpasteit">Blog</a>                </li>                <li>                    <a href="/about">About</a>                </li>            </ul>        </div>    </div></div>        <script>      window.mainPanelOptions = {        addArticleUrl: '/',        loginUrl: '/login',        logoutUrl: '/logout',        favouriteArticlesUrl: '/account/favourite',        subscribedArticlesUrl: '/account/subscribed',        sharedArticlesUrl: '/account/shared',        manageAccountUrl: '/account/manage',        messagesUrl: '/account/messages',        articlesStatsUrl: '/account/articles-stats',        premiumUrl: '/premium/subscription',        unreadMessagesUrl: 'https://msg.justpaste.it/api/v1/conversation/unread',        profileSettings: '/account/settings',        isLoggedIn: false,        userEmail: null,        userPermalink: null,        userProfileIsPublic: false,        userProfileLink: null      };          </script>        <script src="/build/mainpanelwidget.80530742.js" async></script>        </body></html>

    The word happy was used 0 times.
    The word nice was used 0 times.
    The word good was used 1 times.
    The word joy was used 3 times.
    The word love was used 3 times.

How do I properly split or count the number of occurrence?如何正确拆分或计算出现次数？ Thank you!谢谢！

Answer 1

You can simply use jsoup: Java HTML Parser library to fetch all text of html structure.您可以简单地使用jsoup：Java HTML Parser库来获取 html 结构的所有文本。

Download jar file from: https://jsoup.org/download从以下位置下载 jar 文件： https : //jsoup.org/download

Below code will count occurrences of words:下面的代码将计算单词的出现次数：

static void countOccurance(String htmlStructure) {
        String[] positive = { "happy", "nice", "good", "joy", "love" };
        Document document = Jsoup.parse(htmlStructure);
        String[] text = document.body().text().split("\\s+");
        for (String word : positive) {
            int wordCount = countWord(text, word);
            System.out.println("The word " + word + " was used " + wordCount + " times.");
        }
    }

    static int countWord(String[] documentText, String wordToFind) {
        int count = 0;
        for (int i = 0; i < documentText.length; i++) {
            if (wordToFind.equalsIgnoreCase(documentText[i]))
                count++;
        }
        return count;
    }

Answer 2

This will help you to remove special characters, this will only allow alphabets for example : <>Hello<> will be replaced like Hello这将帮助您删除特殊字符，这将只允许使用字母，例如：<>Hello<> 将被替换为Hello

String alphaOnly = input.replaceAll("[^a-zA-Z]+",""); String alphaOnly = input.replaceAll("[^a-zA-Z]+","");

我们如何使用 java 中的字符串操作从 html 文件中拆分单词？

问题描述

2 个解决方案

解决方案1
2 2021-05-28 19:00:42

解决方案2
0 已采纳 2021-05-28 18:53:42

我们如何使用 java 中的字符串操作从 html 文件中拆分单词？

问题描述

2 个解决方案

解决方案1 2 2021-05-28 19:00:42

解决方案2 0 已采纳 2021-05-28 18:53:42

解决方案1
2 2021-05-28 19:00:42

解决方案2
0 已采纳 2021-05-28 18:53:42