[英]python how to count the number of words in html line by line
i want to peform simple tokenization to count the number of words in html line by line, except the words between <a>
tag and the words between <a>
tag will count individually 我想执行简单的标记化以逐行计算html中的单词数,除了
<a>
标记之间的单词和<a>
标记之间的单词将单独计数
can nltk do this? nltk可以这样做吗? or there any library can do this?
还是有图书馆可以做到这一点?
for example : this the html code 例如:这是html代码
<div class="side-article txt-article">
<p><strong>BATAM.TRIBUNNEWS.COM, BINTAN</strong> - Tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="Bintan">Bintan</a>, Senin (3/10/2016).</p>
<p>Empat perwira baru Senin itu diminta cepat bekerja. Tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p>
<p>Para pejabat tersebut yakni AKP Adi Kuasa Tarigan, Kasat Reskrim baru yang menggantikan AKP Arya Tesa Brahmana. Arya pindah sebagai Kabag Ops di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> Tanjungpinang.</p>
and i want the output will be 我希望输出将是
WordsCount : 0 LinkWordsCount : 0
WordsCount : 21 LinkWordsCount : 2
WordsCount : 19 LinkWordsCount : 0
WordsCount : 25 LinkWordsCount : 2
WordsCount is the number of words in each line except the text between <a>
tag. WordsCount是除
<a>
标记之间的文本外,每行中的单词数。 And if there a word appear twice it will be count as two. 如果一个单词出现两次,将被视为两个。 LinkWordsCount is the number of words in between
<a>
tag. LinkWordsCount是
<a>
标记之间的单词数。
so how to make it count line by line except the <a>
tag, and the words between <a>
tag will count individually. 因此,如何使它除
<a>
标记外逐行计数,并且<a>
标记之间的单词将单独计数。
Thank You. 谢谢。
Iterate over each line of raw HTML and simply search for links in each line. 遍历原始HTML的每一行,然后简单地搜索每一行中的链接。
In the example below, I am using a very naive way for getting the words count - split the line by spaces (this way -
is counted as word and BATAM.TRIBUNNEWS.COM
counts as a single word). 在下面的示例中,我使用一种非常幼稚的方式来获取单词计数-用空格分隔行(这种方式
-
被视为单词,而BATAM.TRIBUNNEWS.COM
视为单个单词)。
from bs4 import BeautifulSoup
html = """
<div class="side-article txt-article">
<p><strong>BATAM.TRIBUNNEWS.COM, BINTAN</strong> - Tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="Bintan">Bintan</a>, Senin (3/10/2016).</p>
<p>Empat perwira baru Senin itu diminta cepat bekerja. Tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p>
<p>Para pejabat tersebut yakni AKP Adi Kuasa Tarigan, Kasat Reskrim baru yang menggantikan AKP Arya Tesa Brahmana. Arya pindah sebagai Kabag Ops di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> Tanjungpinang.</p>
"""
soup = BeautifulSoup(html.strip(), 'html.parser')
for line in html.strip().split('\n'):
link_words = 0
line_soup = BeautifulSoup(line.strip(), 'html.parser')
for link in line_soup.findAll('a'):
link_words += len(link.text.split())
# naive way to get words count
words_count = len(line_soup.text.split())
print ('WordsCount : {0} LinkWordsCount : {1}'
.format(words_count, link_words))
Output: 输出:
WordsCount : 0 LinkWordsCount : 0
WordsCount : 16 LinkWordsCount : 2
WordsCount : 17 LinkWordsCount : 0
WordsCount : 25 LinkWordsCount : 1
EDIT 编辑
If you want to read the HTML from a file, use something like this: 如果要从文件中读取HTML,请使用以下内容:
with open(path_to_html_file, 'r') as f:
html = f.read()
I would suggest to try to go with RegEx in python that is re 我会建议尝试用正则表达式去蟒蛇是重
To count link words use regex that count href= like this one 要计算链接文字中使用正则表达式是数HREF =像这样一个
RegEx also will help you to find words that don't include < > and by spliting them with space you will have array that you can len and have number of words. RegEx还可以帮助您查找不包含<>的单词,并用空格将它们分开,您将可以得到一个数组,该数组可以扩展并且可以包含多个单词。
That would be the path I would take. 那就是我要走的路。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.