简体   繁体   English

提取网页中的字数

[英]Fetching word count in a web page

This must have been a very generic question but I have not come across any concrete or stable solution for this. 这肯定是一个非常笼统的问题,但是我没有遇到任何具体或稳定的解决方案。

I just want to fetch the number of words in a web page but across all the browsers. 我只想获取网页中所有浏览器中的单词数。 My current implementation is 我当前的实现是

var body = top.document.body;
if(body) {
    var content = body.innerText || body.textContent;
    content = content.replace(/\n/ig,' ');
    content = content.replace(/\s+/gi,' ');
    content = content.replace(/(^\s|\s$)/gi,'');
    if(!body.innerText) {
        content = content.replace(/<script/gi,'');
    }
    console.log(content);
    console.log(content.split(' ').length);
}

This works well but it does not work with some Firefox browsers as innerText does not work on Firefox. 这很好用,但不适用于某些Firefox浏览器,因为innerText在Firefox上不起作用。

If I use textContent then it displays the contents of JS tags too if present. 如果我使用textContent,那么它也会显示JS标签的内容(如果存在)。 Eg if a web page content is 例如,网页内容是否为

<body>
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
    <script type="text/javascript"> 
    console.log('Hellow World');
    var some = "some";
    var two = "two";
    var three = "three";
    </script>

    <h1 style="text-align:center">Static content from Nginx</h1>
    <div>
        This is a 
            static.
            <div>
                This is a 
                    static.
            </div>
    </div>
</body>

Then textContent will have JS code too in the content which will give me wrong word count. 然后,textContent的内容中也会包含JS代码,这会给我带来错误的字数统计。

What is the concrete solution that can work across any environment. 什么是可以在任何环境下工作的具体解决方案。

PS: No JQuery PS:没有JQuery

Ok, you have there two problems: 好的,您有两个问题:

Cross-browser innerText 跨浏览器innerText

I'd go with: 我会去:

var text = document.body[('innerText' in document.body) ? 'innerText' : 'textContent'];

That, to prefer innerText over textContent. 那样,宁愿使用innerText而不是textContent。

Stripping result of <script> tags. <script>标签的剥离结果。

dandavis offers a neat solution to that: dandavis为此提供了一个简洁的解决方案:

function noscript(strCode){
    var html = $(strCode.bold()); 
    html.find('script').remove();
    return html.html();
}

And a non-jQuery solution: 和非jQuery解决方案:

function noscript(strCode){
    return strCode.replace(/<script.*?>.*?<\/script>/igm, '')
}

A function that will turn the string into a "fake" html document, strip its script tags and return the raw result. 该函数会将字符串转换为“伪造”的html文档,剥离其脚本标签并返回原始结果。

Of course, you may improve the function to remove also <style> tags and others. 当然,您可以改善功能,以同时删除<style>标记和其他标记。

Counting letters 盘点字母

Your method to do the job is alright, but still, I think that a simple regex would do the job much better. 您的工作方法还不错,但是我仍然认为,简单的正则表达式会做得更好。 You can count the words in a string using: 您可以使用以下方法计算字符串中的单词:

str.match(/\S+/g).length;

Finally 最后

Final result should look like 最终结果应该像

var body = top.document.body;
if(body) {
    var content = document.body[('innerText' in document.body) ? 'innerText' : 'textContent'];
    content = noscript(content);
    alert(content.match(/\S+/g).length);
}

What about hidden/invisible/overlayed blocks? 隐藏/不可见/覆盖的块呢? do you want to count words inside all of it? 您是否想在其中全部计算单词? what about images (alt tag of image) 图像呢(图像的alt标签)

if you want to count all - just strip tags and count test of all rest blocks. 如果要全部计数-只需剥离标签并计数所有其余块的测试。 smth like that $('body :not(script)').text() 像这样的$('body:not(script)')。text()

Thank you so much for giving such a helpful answers. 非常感谢您提供如此有用的答案。 I found this approach to use if the innerText is not defined in a browser. 如果未在浏览器中定义innerText,我发现可以使用这种方法。 And the result that we get is very much similar to innerText. 而且我们得到的结果与innerText非常相似。 Hence I think it will be consistent across all the browsers. 因此,我认为在所有浏览器中都将保持一致。

All of you please look into it and let me know if this answer can be accepted. 大家请仔细研究一下,让我知道这个答案是否可以接受。 And let me know if you guys find any discrepancy in this method I am using. 并且让我知道你们是否在我使用的这种方法中发现任何差异。

function getWordCount() {
    try {
        var body = top.document.querySelector("body");
        if (body) {
            var content = body.innerText || getInnerText(top.document.body, top);
            content = content.replace(/\n/ig, ' ');
            var wordCount = content.match(/\S+/gi).length;
            return wordCount;
        }
    } catch (e) {
        processError("getWordCount", e);
    }
}


function getInnerText(el, win) {
    try {
        win = win || window;
        var doc = win.document,
            sel, range, prevRange, selString;
        if (win.getSelection && doc.createRange) {
            sel = win.getSelection();
            if (sel.rangeCount) {
                prevRange = sel.getRangeAt(0);
            }
            range = doc.createRange();
            range.selectNodeContents(el);
            sel.removeAllRanges();
            sel.addRange(range);
            selString = sel.toString();
            sel.removeAllRanges();
            prevRange && sel.addRange(prevRange);
        } else if (doc.body.createTextRange) {
            range = doc.body.createTextRange();
            range.moveToElementText(el);
            range.select();
        }
        return selString;
    } catch (e) {
        processError('getInnerText', e);
    }
}

The result that I am getting is same as that of innerText and is more accurate than using regex, or removing tags etc. 我得到的结果与innerText相同,并且比使用正则表达式或删除标签等更为准确。

Please give me ur views on this. 请给我您的意见。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM