简体   繁体   中英

How Do You Scrape (with Python) a Website That Is Not Formatted Like A Table?

I am looking to scrape this website Different Types of Beer using BeautifulSoup and return each style of beer, the ABV of each, and the 'Pairs With' part. Once scraped I am looking to put all of those values into a table where I can take user input to filter it and return beer recommendations for the user based on their cuisine and ABV preference.

I have been trying many different approaches but can't figure it out at all. So far I can only get the following:

import requests
from bs4 import BeautifulSoup 
import pandas as pd
import csv

r = requests.get("https://www.webstaurantstore.com/article/27/different-types-of-beers.html")
soup = BeautifulSoup(r.text, "html.parser")
beer_titles = soup.find_all('h3')[3:-1]
beer_titles_list = []
for b in beer_titles:
    result = b.text.strip()
    beer_titles_list.append(result)
beer_titles_list

This correctly locates the beer titles, but I am unable to locate the ABV and "Pairs with" values.

I am not necessarily looking for the exact answer because I understand that there is a lot more work to be done. I am more just looking for any tips or ways to alter/add to my code that will guide me to my goal.

You need to have a look at the HTML and decide how to locate the elements you need. In the case of this website, you have found how to locate the titles using the H3 tag. The other elements are nearby.

You could iterate through your H3 elements and then use this to locate the adjacent ABV and Pairs with elements. These are found in <p> elements which also contain unwanted <b> tags. Once you have your H3 element, you can then further search sideways or up and down if needed.

BeautifulSoup also lets you locate all of the text elements by searching for NavigableString instances. You can then take just the second element which skips over the <b> tag, which removes the need for using a regular expression.

For example:

import requests
from bs4 import BeautifulSoup, NavigableString
import csv

def get_text(e):
    text_elements = e.find_all(text=lambda x: isinstance(x, NavigableString))

    if len(text_elements) > 1:
        return text_elements[1].strip()
    else:
        return ''

r = requests.get("https://www.webstaurantstore.com/article/27/different-types-of-beers.html")
soup = BeautifulSoup(r.text, "html.parser")
beer_titles_list = []

for beer_title in soup.find_all('h3')[3:-1]:
    result = beer_title.text.strip()
    p = beer_title.find_all_next('p', limit=5)
    abv = get_text(p[1])
    pairs_with = get_text(p[3])

    beer_titles_list.append([result, abv, pairs_with])
    print([result, abv, pairs_with])

This would give you output starting:

['American Lager', '3.2-4.0%', 'American cuisine, spicy food']
['German Helles', '4.8-5.6%', 'German cuisine, pork, brie']
['German Pilsner', '4.6-5.3%', 'German cuisine, poultry, fish, spicy cheese']
['Czech or Bohemian Pilsner', '4.1-5.1%', 'Spicy food, Asian cuisine, sharp cheddar cheese']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM