简体   繁体   中英

python how to count the number of words in html line by line

i want to peform simple tokenization to count the number of words in html line by line, except the words between <a> tag and the words between <a> tag will count individually

can nltk do this? or there any library can do this?

for example : this the html code

<div class="side-article txt-article">
<p><strong>BATAM.TRIBUNNEWS.COM, BINTAN</strong> - Tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="Bintan">Bintan</a>, Senin (3/10/2016).</p>
<p>Empat perwira baru Senin itu diminta cepat bekerja. Tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p>
<p>Para pejabat tersebut yakni AKP Adi Kuasa Tarigan, Kasat Reskrim baru yang menggantikan AKP Arya Tesa Brahmana. Arya pindah sebagai Kabag Ops di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> Tanjungpinang.</p>

and i want the output will be

WordsCount : 0 LinkWordsCount : 0
WordsCount : 21 LinkWordsCount : 2
WordsCount : 19 LinkWordsCount : 0
WordsCount : 25 LinkWordsCount : 2

WordsCount is the number of words in each line except the text between <a> tag. And if there a word appear twice it will be count as two. LinkWordsCount is the number of words in between <a> tag.

so how to make it count line by line except the <a> tag, and the words between <a> tag will count individually.

Thank You.

Iterate over each line of raw HTML and simply search for links in each line.

In the example below, I am using a very naive way for getting the words count - split the line by spaces (this way - is counted as word and BATAM.TRIBUNNEWS.COM counts as a single word).

from bs4 import BeautifulSoup

html = """
<div class="side-article txt-article">
<p><strong>BATAM.TRIBUNNEWS.COM, BINTAN</strong> - Tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="Bintan">Bintan</a>, Senin (3/10/2016).</p>
<p>Empat perwira baru Senin itu diminta cepat bekerja. Tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p>
<p>Para pejabat tersebut yakni AKP Adi Kuasa Tarigan, Kasat Reskrim baru yang menggantikan AKP Arya Tesa Brahmana. Arya pindah sebagai Kabag Ops di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> Tanjungpinang.</p>
"""

soup = BeautifulSoup(html.strip(), 'html.parser')

for line in html.strip().split('\n'):
    link_words = 0

    line_soup = BeautifulSoup(line.strip(), 'html.parser')
    for link in line_soup.findAll('a'):
        link_words += len(link.text.split())

    # naive way to get words count
    words_count = len(line_soup.text.split())
    print ('WordsCount : {0} LinkWordsCount : {1}'
           .format(words_count, link_words))

Output:

WordsCount : 0 LinkWordsCount : 0
WordsCount : 16 LinkWordsCount : 2
WordsCount : 17 LinkWordsCount : 0
WordsCount : 25 LinkWordsCount : 1

EDIT

If you want to read the HTML from a file, use something like this:

with open(path_to_html_file, 'r') as f:
    html = f.read()

I would suggest to try to go with RegEx in python that is re

To count link words use regex that count href= like this one

RegEx also will help you to find words that don't include < > and by spliting them with space you will have array that you can len and have number of words.

That would be the path I would take.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM