简体   繁体   English

如何计算HTML源中的单词(文本)数

[英]How do I count the number of words (text) in an HTML source

I have some html documents for which I need to return the number of words in the document. 我有一些html文档,我需要返回文档中的单词数。 This count should only include actual text (so no html tags eg html, br, etc). 此计数应仅包括实际文本(因此没有html标记,例如html,br等)。

Any ideas how to do this? 任何想法如何做到这一点? Naturally, I would prefer to re-use some code. 当然,我更愿意重用一些代码。

Thanks, 谢谢,

Assaf 阿萨夫

  • Strip out the HTML tags , get the text content , reuse Jsoup 删除HTML标记,获取文本内容,重用Jsoup

  • Read file line by line , hold a Map<String, Integer> wordToCountMap and read through and operate on the Map 逐行读取文件,按住Map<String, Integer> wordToCountMap并读取并操作Map

Solution with jsoup 用jsoup解决方案

private int countWords(String html) throws Exception {
    org.jsoup.nodes.Document dom = Jsoup.parse(html);
    String text = dom.text();

    return text.split(" ").length;
}

I would add an extra step to Jigar's answer: 我会为Jigar的回答增加一个额外的步骤:

  • Parse out the document text using JSoup or Jericho or Dom4j 使用JSoup或Jericho或Dom4j解析文档文本
  • Tokenise the resulting text. 对生成的文本进行标记。 This depends on your definition of a "word". 这取决于您对“单词”的定义。 It is unlikely to be as simple as splitting on white-space. 它不太可能像分裂白色空间那么简单。 And you'll need to deal with punctuation etc. So take a look at the various Tokeniser's available eg from the Lucene or Stanford NLP projects. 并且您需要处理标点符号等。因此,请查看各种Tokeniser,例如来自Lucene或Stanford NLP项目。 Here are some simple examples you will encounter: 以下是您将遇到的一些简单示例:

    "Today I'm going to New York!" - Is "I'm" one word or two? - “我是”一两个字吗? What about "New York"? 那么“纽约”呢?

    "We applied two meta-filters in the analysis" - Is "meta-filter" one word or two? "We applied two meta-filters in the analysis" - “元滤波器”是一个字还是两个字?

And what about badly formatted text, eg missing of a space at the end of a sentence: 那么格式错误的文本呢,例如句子末尾的空格缺失:

"So we went there.And on arrival..."

Tokenising is tricky... 令牌很棘手......

  • Iterate through your tokens and count them up, eg using a HashMap. 迭代你的标记并计算它们,例如使用HashMap。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM