i want to peform simple tokenization to count the number of words in html line by line, except the words between <a>
tag and the words between <a>
tag will count individually
can nltk do this? or there any library can do this?
for example : this the html code
<div class="side-article txt-article">
<p><strong>BATAM.TRIBUNNEWS.COM, BINTAN</strong> - Tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="Bintan">Bintan</a>, Senin (3/10/2016).</p>
<p>Empat perwira baru Senin itu diminta cepat bekerja. Tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p>
<p>Para pejabat tersebut yakni AKP Adi Kuasa Tarigan, Kasat Reskrim baru yang menggantikan AKP Arya Tesa Brahmana. Arya pindah sebagai Kabag Ops di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> Tanjungpinang.</p>
and i want the output will be
WordsCount : 0 LinkWordsCount : 0
WordsCount : 21 LinkWordsCount : 2
WordsCount : 19 LinkWordsCount : 0
WordsCount : 25 LinkWordsCount : 2
WordsCount is the number of words in each line except the text between <a>
tag. And if there a word appear twice it will be count as two. LinkWordsCount is the number of words in between <a>
tag.
so how to make it count line by line except the <a>
tag, and the words between <a>
tag will count individually.
Thank You.
Iterate over each line of raw HTML and simply search for links in each line.
In the example below, I am using a very naive way for getting the words count - split the line by spaces (this way -
is counted as word and BATAM.TRIBUNNEWS.COM
counts as a single word).
from bs4 import BeautifulSoup
html = """
<div class="side-article txt-article">
<p><strong>BATAM.TRIBUNNEWS.COM, BINTAN</strong> - Tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="Bintan">Bintan</a>, Senin (3/10/2016).</p>
<p>Empat perwira baru Senin itu diminta cepat bekerja. Tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p>
<p>Para pejabat tersebut yakni AKP Adi Kuasa Tarigan, Kasat Reskrim baru yang menggantikan AKP Arya Tesa Brahmana. Arya pindah sebagai Kabag Ops di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> Tanjungpinang.</p>
"""
soup = BeautifulSoup(html.strip(), 'html.parser')
for line in html.strip().split('\n'):
link_words = 0
line_soup = BeautifulSoup(line.strip(), 'html.parser')
for link in line_soup.findAll('a'):
link_words += len(link.text.split())
# naive way to get words count
words_count = len(line_soup.text.split())
print ('WordsCount : {0} LinkWordsCount : {1}'
.format(words_count, link_words))
Output:
WordsCount : 0 LinkWordsCount : 0
WordsCount : 16 LinkWordsCount : 2
WordsCount : 17 LinkWordsCount : 0
WordsCount : 25 LinkWordsCount : 1
EDIT
If you want to read the HTML from a file, use something like this:
with open(path_to_html_file, 'r') as f:
html = f.read()
I would suggest to try to go with RegEx in python that is re
To count link words use regex that count href= like this one
RegEx also will help you to find words that don't include < > and by spliting them with space you will have array that you can len and have number of words.
That would be the path I would take.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.