I'm just beginning to dabble with Python, and as many have done I am starting with a web-scraping example to try the language. What I am attempting is to gather everything of a certain tag type and return as a list. For this I am using BeautifulSoup and requests. The site being used for this test is the blog for a small game called 'Staxel'
I can get my code to output the first occurrence of the tag using [soup.find] and [print], but when I change the code to the below I get warnings about printing a list as a fixed variable.
Can someone please indicate what I should be using for this?
# import libraries
import requests
import ssl
from bs4 import BeautifulSoup
# set the URL string
quote_page = 'https://blog.playstaxel.com'
# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)
# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')
# Remove the 'div' of name and get it's value
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
name = name_box.text.strip() #strip() is used to remove the starting and trailing
print ("Title {}".format(name))
By using .find_all()
, you're creating a list
of all occurences of h1
. You simply need to wrap your print statement in a for
loop. Your code with that structure looks like:
# import libraries
import requests
import ssl
from bs4 import BeautifulSoup
# set the URL string
quote_page = 'https://blog.playstaxel.com'
# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)
# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')
# Remove the 'div' of name and get it's value
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
for name in name_box:
print ("Title {}".format(name.text.strip()))
Output:
Title Magic update – feature preview
Title New Years
Title Staxel Changelog for 1.3.52
Title Staxel Changelog for 1.3.49
Title Staxel Changelog for 1.3.48
Title Halloween Update & GOG
Title Staxel Changelog for 1.3.44
Title Staxel Changelog for 1.3.42
Title Staxel Changelog for 1.3.40
Title Staxel Changelog for 1.3.34 to 1.3.39
That's becausesoup.find_all returns a list not a string like soup.find
The snippets below should avoid the error and print any titles found in python 2.7 and 3.*:
Python 3.*:
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
titles = [name.text.strip() for name in name_box] # loop over results and strip extract space
for title in titles: # loop over titles and print
print ("Title {}".format(title))
Python 2.7:
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
titles = [name.text.strip() for name in name_box] # loop over results and strip extract space
for title in titles: # loop over titles and print
print ("Title {}".format(title.encode('utf-8')))
As mentioned in the comments by @Vantagilt the output for him was adding a 'b' before the string. This is to due to differences in the way strings are interpreted between python 2.7 and python 3. Here's a good blog on the subject.
The main point is by default strings are unicode in python 3 therefore the encode part can be dropped. In python 2.7 strings are stored as bytes and need to encoded explicitly or else we will see errors like:
UnicodeEncodeError: 'ascii' codec can't encode character u'\–' in position 13: ordinal not in range(128)
Instead of using attrs
, you can use the class
.
As find_all
will return the list, you have to loop over and format each value.
Python 2.7
name_box = soup.find_all('h1', class_='entry-title')
# name_box is a list, which contain all the value of `h1` tag of given class value
for name in name_box:
title = name.text.strip()
print ("Title {}".format(title.encode('utf-8')))
Python 3.*
name_box = soup.find_all('h1', class_='entry-title')
# name_box is a list, which contain all the value of `h1` tag of given class value
for name in name_box:
title = name.text.strip()
print ("Title {}".format(title))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.