如何使用 BeautifulSoup 和请求从网站获取数据？

Question

I am a beginner in web scraping, and I need help with this problem.我是 web 抓取的初学者，我需要帮助解决这个问题。 The website, allrecipes.com, is a website where you can find recipes based on a search, which in this case is 'pie':网站 allrecipes.com 是一个网站，您可以在其中根据搜索找到食谱，在本例中为“馅饼”：

link to the html file: 'view-source: https://www.allrecipes.com/search/results/?wt=pie&sort=re ' (right click-> view page source)链接到 html 文件：'查看源： https://www.allrecipes.com/search/results/?wt=pie&sort=re '（右键单击->查看页面源）

I want to create a program that takes a input, searches it up on allrecipes, and returns a list with tuples of the first five recipes with data such as the time that takes to make, serving yield, ingrediants, and more.我想创建一个程序，它接受输入，在所有食谱上搜索它，并返回一个包含前五个食谱的元组的列表，其中包含制作时间、上菜产量、配料等数据。 This is my program so far:到目前为止，这是我的程序：

import requests
from bs4 import BeautifulSoup

def searchdata():
    inp=input('what recipe would you like to search')
    url ='http://www.allrecipes.com/search/results/?wt='+str(inp)+'&sort=re'
    r=requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    links=[]

    #fill in code for finding top 3 or five links


    for i in range(3)
        a = requests.get(links[i])
        soupa = BeautifulSoup(a.text, 'html.parser')

        #fill in code to find name, ingrediants, time, and serving size with data from soupa



        names=[]
        time=[]
        servings=[]
        ratings=[]
        ingrediants=[]





searchdata()

Yes, i know, my code is very messy but What should I fill in in the two code fill-in areas?是的，我知道，我的代码很乱但是我应该在两个代码填写区域填写什么？ Thanks谢谢

Answer 1

After searching for the recipe you have to get the links of each recipe and then request again for each of those links, because the information you're looking for is not available on the search page.搜索食谱后，您必须获取每个食谱的链接，然后再次请求每个链接，因为您要查找的信息在搜索页面上不可用。 That would not look clean without OOP so here's the class I wrote that does what you want.如果没有 OOP，那看起来并不干净，所以这是我写的 class，它可以满足您的需求。

import requests
from time import sleep
from bs4 import BeautifulSoup


class Scraper:
    links = []
    names = []

    def get_url(self, url):
        url = requests.get(url)
        self.soup = BeautifulSoup(url.content, 'html.parser')

    def print_info(self, name):
        self.get_url(f'https://www.allrecipes.com/search/results/?wt={name}&sort=re')
        if self.soup.find('span', class_='subtext').text.strip()[0] == '0':
            print(f'No recipes found for {name}')
            return
        results = self.soup.find('section', id='fixedGridSection')
        articles = results.find_all('article')
        texts = []
        for article in articles:
            txt = article.find('h3', class_='fixed-recipe-card__h3')
            if txt:
                if len(texts) < 5:
                    texts.append(txt)
                else:
                    break
        self.links = [txt.a['href'] for txt in texts]
        self.names = [txt.a.span.text for txt in texts]
        self.get_data()

    def get_data(self):
        for i, link in enumerate(self.links):
            self.get_url(link)
            print('-' * 4 + self.names[i] + '-' * 4)
            info_names = [div.text.strip() for div in self.soup.find_all(
                'div', class_='recipe-meta-item-header')]
            ingredient_spans = self.soup.find_all('span', class_='ingredients-item-name')
            ingredients = [span.text.strip() for span in ingredient_spans]
            for i, div in enumerate(self.soup.find_all('div', class_='recipe-meta-item-body')):
                print(info_names[i].capitalize(), div.text.strip())
            print()
            print('Ingredients'.center(len(ingredients[0]), ' '))
            print('\n'.join(ingredients))
            print()
            print('*' * 50, end='\n\n')


chrome = Scraper()
chrome.print_info(input('What recipe would you like to search: '))

如何使用 BeautifulSoup 和请求从网站获取数据？

问题描述

1 个解决方案

解决方案1
0 2020-06-20 21:28:19

如何使用 BeautifulSoup 和请求从网站获取数据？

问题描述

1 个解决方案

解决方案1 0 2020-06-20 21:28:19

解决方案1
0 2020-06-20 21:28:19