简体   繁体   English

如何使用python中的漂亮汤从html页面获取特定单词

[英]How to get a specific word from html page using beautiful soup in python

I have to extract specific words from a HTML page and count the number of times the word has been repeated. 我必须从HTML页面中提取特定的单词,并计算单词重复的次数。 How do I do this using beautiful soup in python? 如何在python中使用漂亮的汤来做到这一点? How do I pass the url in the soup and then count the words ? 如何传递汤中的网址,然后计算字数?

This is my code till now. 到目前为止,这是我的代码。 I have no idea what to do next. 我不知道下一步该怎么做。

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()

soup = bs.BeautifulSoup(source,'lxml')

for paragraph in soup.find_all('p'):
    print(paragraph.string)
    print(str(paragraph.text)) 

You could get all the text in the page using 您可以使用来获取页面中的所有文本

soup.get_text()

After setting that to a variable you could then use the .count() method to find the amount that a certain string appears in the HTML page. 将其设置为变量后,您可以使用.count()方法查找某个字符串出现在HTML页面中的数量。 eg 例如

text = soup.get_text()
print (text.count('word'))

To make sure you aren't getting words inside words you could split everything with a space and then look for them in each index of the list. 为了确保您不会在单词中得到单词,可以用空格将所有内容分开,然后在列表的每个索引中查找它们。 For example 'house' is inside 'houses' would be fixed by this. 例如,“房屋”在“房屋”内部将由此固定。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM