简体   繁体   中英

Python website scraping for all of a tag using 'soup.findall'

I'm just beginning to dabble with Python, and as many have done I am starting with a web-scraping example to try the language. What I am attempting is to gather everything of a certain tag type and return as a list. For this I am using BeautifulSoup and requests. The site being used for this test is the blog for a small game called 'Staxel'

I can get my code to output the first occurrence of the tag using [soup.find] and [print], but when I change the code to the below I get warnings about printing a list as a fixed variable.

Can someone please indicate what I should be using for this?

# import libraries
import requests
import ssl
from bs4 import BeautifulSoup

# set the URL string
quote_page = 'https://blog.playstaxel.com'

# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)


# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')

# Remove the 'div' of name and get it's value
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
name = name_box.text.strip() #strip() is used to remove the starting and trailing
print ("Title {}".format(name))

By using .find_all() , you're creating a list of all occurences of h1 . You simply need to wrap your print statement in a for loop. Your code with that structure looks like:

# import libraries
import requests
import ssl
from bs4 import BeautifulSoup

# set the URL string
quote_page = 'https://blog.playstaxel.com'

# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)


# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')

# Remove the 'div' of name and get it's value
name_box = soup.find_all('h1',attrs={'class':'entry-title'})
for name in name_box:
    print ("Title {}".format(name.text.strip()))

Output:

Title Magic update – feature preview
Title New Years
Title Staxel Changelog for 1.3.52
Title Staxel Changelog for 1.3.49
Title Staxel Changelog for 1.3.48
Title Halloween Update & GOG
Title Staxel Changelog for 1.3.44
Title Staxel Changelog for 1.3.42
Title Staxel Changelog for 1.3.40
Title Staxel Changelog for 1.3.34 to 1.3.39

That's becausesoup.find_all returns a list not a string like soup.find

The snippets below should avoid the error and print any titles found in python 2.7 and 3.*:

Python 3.*:

name_box = soup.find_all('h1',attrs={'class':'entry-title'})
titles = [name.text.strip() for name in name_box]  # loop over results and strip extract space
for title in titles:  # loop over titles and print
    print ("Title {}".format(title))

Python 2.7:

   name_box = soup.find_all('h1',attrs={'class':'entry-title'})
    titles = [name.text.strip() for name in name_box]  # loop over results and strip extract space
    for title in titles:  # loop over titles and print
        print ("Title {}".format(title.encode('utf-8'))) 

As mentioned in the comments by @Vantagilt the output for him was adding a 'b' before the string. This is to due to differences in the way strings are interpreted between python 2.7 and python 3. Here's a good blog on the subject.

The main point is by default strings are unicode in python 3 therefore the encode part can be dropped. In python 2.7 strings are stored as bytes and need to encoded explicitly or else we will see errors like:

UnicodeEncodeError: 'ascii' codec can't encode character u'\–' in position 13: ordinal not in range(128)

Instead of using attrs , you can use the class .

As find_all will return the list, you have to loop over and format each value.

Python 2.7

name_box = soup.find_all('h1', class_='entry-title')
# name_box is a list, which contain all the value of `h1` tag of given class value

for name in name_box:
  title = name.text.strip() 
  print ("Title {}".format(title.encode('utf-8')))

Python 3.*

name_box = soup.find_all('h1', class_='entry-title')
# name_box is a list, which contain all the value of `h1` tag of given class value

for name in name_box:
  title = name.text.strip() 
  print ("Title {}".format(title))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM