简体   繁体   English

如何使用 BeautifulSoup 和请求从网站获取数据?

[英]How can I get data from a website using BeautifulSoup and requests?

I am a beginner in web scraping, and I need help with this problem.我是 web 抓取的初学者,我需要帮助解决这个问题。 The website, allrecipes.com, is a website where you can find recipes based on a search, which in this case is 'pie':网站 allrecipes.com 是一个网站,您可以在其中根据搜索找到食谱,在本例中为“馅饼”:

link to the html file: 'view-source: https://www.allrecipes.com/search/results/?wt=pie&sort=re ' (right click-> view page source)链接到 html 文件:'查看源: https://www.allrecipes.com/search/results/?wt=pie&sort=re '(右键单击->查看页面源)

I want to create a program that takes a input, searches it up on allrecipes, and returns a list with tuples of the first five recipes with data such as the time that takes to make, serving yield, ingrediants, and more.我想创建一个程序,它接受输入,在所有食谱上搜索它,并返回一个包含前五个食谱的元组的列表,其中包含制作时间、上菜产量、配料等数据。 This is my program so far:到目前为止,这是我的程序:

import requests
from bs4 import BeautifulSoup

def searchdata():
    inp=input('what recipe would you like to search')
    url ='http://www.allrecipes.com/search/results/?wt='+str(inp)+'&sort=re'
    r=requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    links=[]

    #fill in code for finding top 3 or five links


    for i in range(3)
        a = requests.get(links[i])
        soupa = BeautifulSoup(a.text, 'html.parser')

        #fill in code to find name, ingrediants, time, and serving size with data from soupa



        names=[]
        time=[]
        servings=[]
        ratings=[]
        ingrediants=[]





searchdata()

Yes, i know, my code is very messy but What should I fill in in the two code fill-in areas?是的,我知道,我的代码很乱但是我应该在两个代码填写区域填写什么? Thanks谢谢

After searching for the recipe you have to get the links of each recipe and then request again for each of those links, because the information you're looking for is not available on the search page.搜索食谱后,您必须获取每个食谱的链接,然后再次请求每个链接,因为您要查找的信息在搜索页面上不可用。 That would not look clean without OOP so here's the class I wrote that does what you want.如果没有 OOP,那看起来并不干净,所以这是我写的 class,它可以满足您的需求。

import requests
from time import sleep
from bs4 import BeautifulSoup


class Scraper:
    links = []
    names = []

    def get_url(self, url):
        url = requests.get(url)
        self.soup = BeautifulSoup(url.content, 'html.parser')

    def print_info(self, name):
        self.get_url(f'https://www.allrecipes.com/search/results/?wt={name}&sort=re')
        if self.soup.find('span', class_='subtext').text.strip()[0] == '0':
            print(f'No recipes found for {name}')
            return
        results = self.soup.find('section', id='fixedGridSection')
        articles = results.find_all('article')
        texts = []
        for article in articles:
            txt = article.find('h3', class_='fixed-recipe-card__h3')
            if txt:
                if len(texts) < 5:
                    texts.append(txt)
                else:
                    break
        self.links = [txt.a['href'] for txt in texts]
        self.names = [txt.a.span.text for txt in texts]
        self.get_data()

    def get_data(self):
        for i, link in enumerate(self.links):
            self.get_url(link)
            print('-' * 4 + self.names[i] + '-' * 4)
            info_names = [div.text.strip() for div in self.soup.find_all(
                'div', class_='recipe-meta-item-header')]
            ingredient_spans = self.soup.find_all('span', class_='ingredients-item-name')
            ingredients = [span.text.strip() for span in ingredient_spans]
            for i, div in enumerate(self.soup.find_all('div', class_='recipe-meta-item-body')):
                print(info_names[i].capitalize(), div.text.strip())
            print()
            print('Ingredients'.center(len(ingredients[0]), ' '))
            print('\n'.join(ingredients))
            print()
            print('*' * 50, end='\n\n')


chrome = Scraper()
chrome.print_info(input('What recipe would you like to search: '))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用BeautifulSoup从网站获取var脚本中的json数据? - How to get the json data in a var script from a website using BeautifulSoup? 如何使用 BeautifulSoup 从本网站获取价格? - How do I get the price from this website using BeautifulSoup? 如何使用 BeautifulSoup4 从 Python 中的网站获得经常更新的 php 文本? - How can I get frequently updated .php text from a website in Python using BeautifulSoup4? 使用 python 请求和 BeautifulSoup 从带有框架或 flexbox 的网站中抓取数据 - Scrape data from website with frames or flexbox using python requests and BeautifulSoup 使用请求和 BeautifulSoup 从网站的不同选项卡中抓取数据 - Scraping data from different tabs of a website using requests and BeautifulSoup 如何使用 Beautifulsoup 从网站上获取产品价格? - How can I scrape a product price from a website using Beautifulsoup? 如何使用 BeautifulSoup 从网站获取不可见数据 - How to get invisible data from website with BeautifulSoup 为什么我无法从使用BeautifulSoup的网站获得缺乏数据? 我收到超时错误 - Why can't I get scarped data from a website with BeautifulSoup? I'm getting timeout error 如何获取<a>在 python 中使用 BeautifulSoup 的 href 属性中的数据?</a> - how can i get data that is in href attribute of <a> using BeautifulSoup in python? 如何从 BeautifulSoup 中的跨度获取数据? - How can I get the data from a span in BeautifulSoup?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM