简体   繁体   中英

How to get a specific word from html page using beautiful soup in python

I have to extract specific words from a HTML page and count the number of times the word has been repeated. How do I do this using beautiful soup in python? How do I pass the url in the soup and then count the words ?

This is my code till now. I have no idea what to do next.

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()

soup = bs.BeautifulSoup(source,'lxml')

for paragraph in soup.find_all('p'):
    print(paragraph.string)
    print(str(paragraph.text)) 

You could get all the text in the page using

soup.get_text()

After setting that to a variable you could then use the .count() method to find the amount that a certain string appears in the HTML page. eg

text = soup.get_text()
print (text.count('word'))

To make sure you aren't getting words inside words you could split everything with a space and then look for them in each index of the list. For example 'house' is inside 'houses' would be fixed by this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM