如何使用python中的漂亮汤从html页面获取特定单词

Question

I have to extract specific words from a HTML page and count the number of times the word has been repeated. 我必须从HTML页面中提取特定的单词，并计算单词重复的次数。 How do I do this using beautiful soup in python? 如何在python中使用漂亮的汤来做到这一点？ How do I pass the url in the soup and then count the words ? 如何传递汤中的网址，然后计算字数？

This is my code till now. 到目前为止，这是我的代码。 I have no idea what to do next. 我不知道下一步该怎么做。

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()

soup = bs.BeautifulSoup(source,'lxml')

for paragraph in soup.find_all('p'):
    print(paragraph.string)
    print(str(paragraph.text))

Answer 1

You could get all the text in the page using 您可以使用来获取页面中的所有文本

soup.get_text()

After setting that to a variable you could then use the .count() method to find the amount that a certain string appears in the HTML page. 将其设置为变量后，您可以使用.count（）方法查找某个字符串出现在HTML页面中的数量。 eg 例如

text = soup.get_text()
print (text.count('word'))

To make sure you aren't getting words inside words you could split everything with a space and then look for them in each index of the list. 为了确保您不会在单词中得到单词，可以用空格将所有内容分开，然后在列表的每个索引中查找它们。 For example 'house' is inside 'houses' would be fixed by this. 例如，“房屋”在“房屋”内部将由此固定。

如何使用python中的漂亮汤从html页面获取特定单词

问题描述

1 个解决方案

解决方案1
0 2017-11-05 11:13:44

如何使用python中的漂亮汤从html页面获取特定单词

问题描述

1 个解决方案

解决方案1 0 2017-11-05 11:13:44

解决方案1
0 2017-11-05 11:13:44