简体   繁体   English

请求/美丽的汤/ Python

[英]Requests / Beautiful Soup / Python

I am studying web scrapping and I need some help.我正在研究 web 报废,我需要一些帮助。 The page that is returning to me has a strange encoding.返回给我的页面有一个奇怪的编码。 How can I fix this?我怎样才能解决这个问题? Why my page don't show like link below?为什么我的页面不显示像下面的链接? https://www.saraiva.com.br/redes-de-computadores-ii-niveis-de-transporte-e-rede-serie-tekne-6194354/p https://www.saraiva.com.br/redes-de-computadores-ii-niveis-de-transporte-e-rede-serie-tekne-6194354/p

The page that is returning to me has a strange encoding.返回给我的页面有一个奇怪的编码。 How can I fix this?我怎样才能解决这个问题?

import requests 
from bs4 import BeautifulSoup

def get_http(url, nome_livro):

    nome_livro = nome_livro.replace(' ', '%20')
    url = '{0}?q={1}'.format(url, nome_livro)

    try:
        return requests.get(url)
    except (requests.exceptions.HTTPError, requests.exceptions.RequestException,
            requests.exceptions.ConnectionError, requests.exceptions.Timeout) as e:

            print(str(e))
            pass
    except Exception as e:
            raise

def get_products(content):

    soup = BeautifulSoup(content, 'lxml')
    products = soup.find_all('div', {'class': 'nm-product-img-container'}, limit = 10)

    list_products = []
    for product in products:
        info_product = [product.a.get('href').replace("//", "http://"), product.a.string]
        list_products.append(info_product)

    return list_products 

def get_http_page_product(list_products):
    
    for product in list_products:

        try:
            r = requests.get(product[0])
            print(r.url)
            print(r.encoding)
            '''r.encoding = 'ISO-8859-1'''

        except (requests.exceptions.HTTPError, requests.exceptions.RequestException,
                    requests.exceptions.ConnectionError, requests.exceptions.Timeout) as e:
                print(str(e))
                r = None
        except Exception as e:
                raise    

        print(product[0])
        print(product[1])        
        parse_page_product(r.text, product[0], product[1]) 
        break   

def parse_page_product(content, url_product, title):

    soup = BeautifulSoup(content, 'lxml')
    with open('result.html', 'w') as f:
        f.write(content)     

if __name__=='__main__':
    url = 'http://busca.saraiva.com.br/busca'         
    nome_livro = 'redes de computadores'

    r = get_http(url, nome_livro)

    if r:
        list_products = get_products(r.text)
        get_http_page_product(list_products)

The reason why you don't see the page in code like in a browser is that it is loaded dynamically.您没有像在浏览器中那样在代码中看到该页面的原因是它是动态加载的。 If you understand how requests work in the browser, you can find the request to the API. Using the link you chose as an example, I will show you how to get title, description, pictures, sellers and prices.如果您了解请求在浏览器中的工作原理,您可以找到请求到 API。以您选择的链接为例,我将向您展示如何获取标题、描述、图片、卖家和价格。 Then you choose the fields you need.然后你选择你需要的字段。 To do this, we only need to know the product SKU.为此,我们只需要知道产品 SKU。 First of all, we will connect the necessary libraries首先,我们将连接必要的库

import requests
from bs4 import BeautifulSoup
import json

Now let's prepare the get SKU function现在让我们准备获取 SKU function

def get_sku(url):
    response = requests.request("GET", url)
    soup = BeautifulSoup(response.text, features='lxml')
    return soup.find('meta', {"itemprop": "sku"}).get('content')

Now we have received the SKU and can substitute it in the API.现在我们收到了SKU,可以在API中代换。

def get_product_details(sku):
    url = f"https://www.saraiva.com.br/api/catalog_system/pub/products/search?fq=skuId:{sku}"
    headers = {
        'accept': '*/*'
    }
    response = requests.request("GET", url, headers=headers)
    json_obj = json.loads(response.text)
    print(json_obj[0]['productTitle'])
    print(json_obj[0]['description'])
    for item in json_obj[0]['items']:
        for images in item['images']:
            print(images['imageUrl'])
        for seller in item['sellers']:
            print(seller['sellerName'], seller['commertialOffer']['Price'])

Now we just have to pass the reference to our function and we can look at the result现在我们只需将引用传递给我们的 function 我们就可以查看结果了

get_product_details(get_sku("https://www.saraiva.com.br/redes-de-computadores-ii-niveis-de-transporte-e-rede-serie-tekne-6194354/p"))

Output: Output:

Redes de Computadores II - Níveis de Transporte e Rede - Série Tekne
Com o intuito de oferecer os subsídios necessários para a formação qualificada na área, esta obra aborda a conceituação e aplicação dos protocolos de redes e dos equipamentos de comunicação de dados, especificamente na camada de transporte e rede. Este livro é uma parceria das editoras do Grupo A Educação com o IFRS (Instituto Federal de Educação, Ciência e Tecnologia do Rio Grande do Sul).
https://lojasaraiva.vteximg.com.br/arquivos/ids/9206261/1008454251.jpg?v=637103785666070000
Saraiva 77.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM